Closed nbuton closed 3 years ago
Dear Nico,
thanks you for pointing out this issue. Your call for create_datasets.sh
is indeed correct- I will correct this at some point.
As described in our paper, we intentionally use two different Swiss-Prot version for EC50 and EC40 to mimic datasets from the literature. You should be able to create the datasets yourself (the corresponding pkl files should be created after the first run). But as this whole process is somewhat annoying I will try to deposit the processed files for level 2 via our datacloud. Ping me again in case I forgot to do it towards the end of this week. Best Nils
I just realized that the links to our nextcloud instance slightly changed. pkl files and pretrained models should be accessible again now. Given these pickle files the dataset construction is completely deterministic i.e. you should end up with exactly the datasets we used in our experiments.
Thanks a lot for your help, I successfully run create_datasets.sh after downloaded the two versions of Swiss-Prot.
Hello, I am trying to replicate the same dataset for EC prediction(EC40 and EC50) as in your paper UDSMProt but I find some difficulties.
First in your script code/create_datasets.sh in line 27 :
python proteomics_preprocessing.py clas_ec --drop_ec7=True --working_folder=datasets/clas_ec/clas_ec_ec50_level1 --pretrained_folder=datasets/lm/lm_sprot_uniref --level=2 --include_NoEC=False --dataset="uniprot" --sampling_method_train=1 --sampling_method_valtest=3 --ignore_pretrained_clusters=True --sampling_ratio=[.8,.1,.1] --save_prev_ids=True
I think it should be :python proteomics_preprocessing.py clas_ec --drop_ec7=True --working_folder=datasets/clas_ec/clas_ec_ec50_level2 --pretrained_folder=datasets/lm/lm_sprot_uniref --level=2 --include_NoEC=False --dataset="uniprot" --sampling_method_train=1 --sampling_method_valtest=3 --ignore_pretrained_clusters=True --sampling_ratio=[.8,.1,.1] --save_prev_ids=True The working folder should be name clas_ec_ec50_level2 not clas_ec_ec50_level1 .
Secondly when I run the script create_datasets.sh, I have an error that says : ../tmp_data/cdhit04_uniprot_sprot_2017_03.pkl not found. And I think this maybe because in the link you provide : there are two files with two different versions of swissprot. The file "cdhit04_uniprot_sprot_2016_07.pkl" which uses the 07/2016 version and the file "uniref50_2017_03_uniprot_sprot_2017_03.pkl" which uses the 03/2017 version.
So I am a little bit confused, I don't know if I have to download the Swiss-Prot release of 03/2017 or 07/2016 with the files in your link in order to replicate exactly the same dataset as you in your UDSMProt paper. And even if I have the correct cdhit04 file, will I have exactly the same test dataset ? And if it is not the case, it would be kind if you could provide a link to download the exact train/dev/test dataset for ECpred.
Thanks a lot in advance