Open bioinfoMMS opened 4 years ago
Hi Maddy,
thanks for notice this issue, about the headers, yep, your are right the headers must have an specific but simple format like: >gbk|xxxx|acc|xxxxx blablabla
or >acc|xxxx blablabla
that's reason of why in the case >acc|GENBANK|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014
shows an empty ti|
. so, to solve you can do this:
acc|
: sed "s/GENBANK\|//g" yourfasta.fasta
ti||acc|xxxxx
): sed "s/ti\|\|//g" yourfasta.fasta
the secret is maintain the format ncbiID|number
.
hope it can help
cheers
PD: try the new code, I already make changes to avoid the double ti|xxxx|ti|xxxxx
in the final fasta
Thank you for your quick response! I put all my headers into the format of >acc|xxxx blablabla but only the first entry in the multifasta file had its taxon id successfully fetched. The others all have the empty ti|| at the beginning. Any thoughts?
Your welcome, would you share some headers to reproduce the error?, or your entire fasta (or a half) would be perfect, I'm wondering if something goes wrong with the ID selection in the code. Also paste the line code you are using.
cheers
I attached a portion of my fasta file. I run pastetaxid with:
bash pasteTaxID.bash --multifasta viralseqs.txt --parallelJobs 8 --apikey myncbiapikey
I was able to get the example multifasta file to work, with all IDs added
Thanks Maddy, I found the bug and already fixed it, the problem was only with the parallel works that chrash when taxID is found and connection retries is still trying to fetch the ID. try using this fixed code: https://github.com/Sanrrone/pasteTaxID while I create the pull request to this repo.
also I attached the new fasta with the corresponding taxID. new_viralseqs.txt
let me know if it works to close the issue or continue improving the code.
cheers
The fix seems to be giving me the taxon ids now. Thank you!
However, I have noticed that the script sometimes get stuck getting IDs from NCBI (the output with --debug shows that its tries to fetch the ID from NCBI over and over again with no luck). It does not appear to get stuck on the file I sent you (though its been running over 40 mins now with 8 processors and is not done), but another set of sequences (attached) gets stuck. It sounds similar to the problem mladen5000 had in March. I am also submitting it as a remote job.
I checked some info about the parallel job for ncbi api key and I found they provide a maximum of 10 request per second (https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)
Now that I have a key, are there still access limits?
Yes. By default, your key will increase the limit to 10 requests/second for all activity from that key.
If we receive requests at a higher frequency that include the same key, all requests using that key will receive an error message.
If you need higher rates of access, please contact us and we can discuss your situation.
I checked this using parallel jobs at 8, and 20. Using 8 parallel process I have no error and the new fasta is fine, but using 20 (that is more than ncbi let you fetch), the script get stuck, the retries are forcing for a new connection because ncbi return an error or just "blank" result (the retry is just an implement to avoid errors like internet connection but I didn't expect an overrequest error).
so, the solution is just not let the parallel job be more than 10, if you have 50 cores, anyway set it to 10.
let me know if it's continues getting stuck using just 10 process.
@bioinfoMMS an update Maddy, the script sometimes reach more than one fetch per process, so try to set the jobs to 9 insteaf of 10, check this performance:
sandro@elitedesk1$ time $(bash pasteTaxID.bash --multifasta test_viral_2.txt --parallelJobs 10 --apikey mykey)
real 2m46.568s
user 1m8.963s
sys 0m26.808s
sandro@elitedesk1$ time $(bash pasteTaxID.bash --multifasta test_viral_2.txt --parallelJobs 9 --apikey mykey)
real 0m54.901s
user 0m17.986s
sys 0m7.231s
sandro@elitedesk1:$ time $(bash pasteTaxID.bash --multifasta test_viral_2.txt --parallelJobs 8 --apikey mykey)
real 0m55.204s
user 0m15.851s
sys 0m6.516s
cheers
I still had the problem with 8 processors. It still could be something due to NCBI's limit but it is strange that you are able to run it with no problems and really quickly with the same number of processors. It also seems to have gotten stuck for me on the first file I sent you as well :(
Ok so I ran it with no parallelization:
bash pasteTaxID.bash --multifasta test_viral_2.txt --debug
And it still gets stuck , the problematic taxon id was JF714137.1 if that helps. I have it as 'acc' even though it is a Genbank sequence. Would this cause a problem? Does it need to be labeled as 'gbk'?
mmm, a it's weird behaviour. I just already run the script with a fake fasta
>acc|JF714137.1 fakeID
CGCAGTCAGTCAGTCACGTACAGTCACGATCA
and works
>ti|1000308|acc|JF714137.1 fakeID
CGCAGTCAGTCAGTCACGTACAGTCACGATCA
I'm thinking about the internet, sometimes I remind stucks when it was slow or "intermittent", have you another connection? (maybe cellphone?), sorry I have no more ideas at the moment, it not seems a code problem :(.
It works well enough that I should be able to use your script, thank you again for all your help and quick replies! It is much appreciated!
your welcome :), feel free to ask any other question.
good luck with your work.
Hello,
I wonder if you could give me more information about the required format for the fasta headers. I have been running pasteTaxID and while I don't get any errors, the tax ids do not show up in the results.
This header:
'>acc|GENBANK|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'
Comes out as:
'>ti||acc|GENBANK|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'
I am guessing it may have something to do with the header format. I did try to remove the GENBANK part so the header was:
'>acc|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'
A few of the tax ids were found, but most were not. For example:
' >ti||acc|FJ640294.1|Uncultured_marine_virus_isolate_CBSM-188_genomic_sequence.|uncultured_marine_virus|ENV|07-APR-2009' '>ti|186617|acc|FJ640295.1|Uncultured_marine_virus_isolate_CBSM-189_genomic_sequence.|uncultured_marine_virus|ENV|07-APR-2009'
Any help would be appreciated.
Thanks, Maddy