tttianhao / CLEAN

CLEAN: a contrastive learning model for high-quality functional prediction of proteins
MIT License
224 stars 44 forks source link

offline tool issue with the training #40

Closed Bio-finder closed 9 months ago

Bio-finder commented 11 months ago

Hello, I am currently trying to run your tool with the docker container and I am facing an issue at the training step. Here is the command line that I use: /shared/projects/seabioz/softwares/CLEAN/clean-1.0.1.sif python ./scripts/train-supconH.py --training_data split100 --model_name split100_supconH --epoch 4100 --n_pos 9 --n_neg 30 -T 0.1

And here is the error I get: Traceback (most recent call last): File "/shared/projects/seabioz/softwares/CLEAN/./scripts/train-supconH.py", line 139, in <module> main() File "/shared/projects/seabioz/softwares/CLEAN/./scripts/train-supconH.py", line 118, in main train_loss = train(model, args, epoch, train_loader, File "/shared/projects/seabioz/softwares/CLEAN/./scripts/train-supconH.py", line 50, in train for batch, data in enumerate(train_loader): File "/usr/local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 634, in __next__ data = self._next_data() File "/usr/local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/site-packages/CLEAN-0.1-py3.10.egg/CLEAN/dataloader.py", line 105, in __getitem__ File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 791, in load with _open_file_like(f, 'rb') as opened_file: File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 271, in _open_file_like return _open_file(name_or_buffer, mode) File "/usr/local/lib/python3.10/site-packages/torch/serialization.py", line 252, in __init__ super().__init__(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: './data/esm_data/P32143_6.pt'

It's like a pt file was missing but I downloaded them with your scripts so I don't get what could be missing?

zhangjun19thu commented 9 months ago

The P32143_6.pt was changed from P32143. I got this error too...In 1.0.1 version, the utils.py does not have the function of "mutate_single_seq_ECs", I guess the error was due to this? Have you fixed this error?

canallee commented 9 months ago

Yes, if an EC number only has one sequence (likely the case for P32143 here), it needs to be "mutated" (randomly masked from ESM-1b's view) so that it is possible to sample positive for the anchor. Please make sure that you have followed the instructions for the step that involves mutate_single_seq_ECs(train_file). Let us know if there is any further questions.