leaemiliepradier / PlasForest

A random forest classifier to identify contigs of plasmid origin in contig and scaffold genomes
GNU General Public License v3.0
17 stars 6 forks source link

do you know what is the problem when downloading the database #13

Closed Wanli-HE closed 1 year ago

Wanli-HE commented 1 year ago

/mibi/Wanli/anaconda/envs/plasplinev1.4.1/lib/python3.9/site-packages/Bio/Entrez/Parser.py:903: UserWarning: Failed to save epost.dtd at /usr/local/home/hsv709/.config/biopython/Bio/Entrez/DTDs/epost.dtd warnings.warn("Failed to save %s at %s" % (filename, path)) Traceback (most recent call last): File "/mibi/users/Wanli/test_plasplinev1.4.1/Plaspline/db/db/plasforest/check_and_download_database.py", line 95, in download_missing(list_missing, email) File "/mibi/users/Wanli/test_plasplinev1.4.1/Plaspline/db/db/plasforest/check_and_download_database.py", line 77, in download_missing result = Entrez.read(request) File "/mibi/Wanli/anaconda/envs/plasplinev1.4.1/lib/python3.9/site-packages/Bio/Entrez/init.py", line 508, in read record = handler.read(handle) File "/mibi/Wanli/anaconda/envs/plasplinev1.4.1/lib/python3.9/site-packages/Bio/Entrez/Parser.py", line 304, in read self.parser.ParseFile(handle) File "/home/conda/feedstock_root/build_artifacts/python-split_1653669926144/work/Modules/pyexpat.c", line 459, in EndElement File "/mibi/Wanli/anaconda/envs/plasplinev1.4.1/lib/python3.9/site-packages/Bio/Entrez/Parser.py", line 666, in endErrorElementHandler raise RuntimeError(value) RuntimeError: Some IDs have invalid value and were omitted. Maximum ID value 18446744073709551615

Wanli-HE commented 1 year ago

here is the code: ./database_downloader.sh

tazziotissot commented 1 year ago

This error actually comes from the script check_and_download_database.py. NCBI Entrez often fails when downloading numerous sequences at once, and doesn't try again after it failed. The new version of this script allows to define smaller batches when downloading complementary sequences, and to try again for sequences that weren't downloaded.

Wanli-HE commented 1 year ago

This error actually comes from the script check_and_download_database.py. NCBI Entrez often fails when downloading numerous sequences at once, and doesn't try again after it failed. The new version of this script allows to define smaller batches when downloading complementary sequences, and to try again for sequences that weren't downloaded.

ok, so i need to re-downloading it, until it no any errors raising?

tazziotissot commented 1 year ago

If you have already run the script database_downloader.sh, then you should already have a few thousands of sequences in the file plasmid_refseq.fasta. If so, you can directly run the command python3 check_and_download_database.py download. It will ask you for a few things to make sure all the sequences are downloaded. If you don't manage to get all the sequences, you should still get pretty good performances. If you have deleted the file plasmid_refseq.fasta, then you should probably run database_downloader.sh again.