Required FASTA header format?

bioinfoMMS commented 4 years ago

Hello,

I wonder if you could give me more information about the required format for the fasta headers. I have been running pasteTaxID and while I don't get any errors, the tax ids do not show up in the results.

This header:

'>acc|GENBANK|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'

Comes out as:

'>ti||acc|GENBANK|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'

I am guessing it may have something to do with the header format. I did try to remove the GENBANK part so the header was:

'>acc|AB866984.1|Human_immunodeficiency_virus_1_gene_for_pol_protein,_partial_cds,_isolate:_F10-5112353-1.|Human_immunodeficiency_virus_1|VRL|25-JUL-2014'

A few of the tax ids were found, but most were not. For example:

' >ti||acc|FJ640294.1|Uncultured_marine_virus_isolate_CBSM-188_genomic_sequence.|uncultured_marine_virus|ENV|07-APR-2009' '>ti|186617|acc|FJ640295.1|Uncultured_marine_virus_isolate_CBSM-189_genomic_sequence.|uncultured_marine_virus|ENV|07-APR-2009'

Any help would be appreciated.

Thanks, Maddy

Sanrrone commented 4 years ago

remove the GENBANK after acc|: sed "s/GENBANK\|//g" yourfasta.fasta
remove "empty" entries (like ti||acc|xxxxx): sed "s/ti\|\|//g" yourfasta.fasta

the secret is maintain the format ncbiID|number.

hope it can help

cheers

PD: try the new code, I already make changes to avoid the double ti|xxxx|ti|xxxxx in the final fasta

bioinfoMMS commented 4 years ago

Thank you for your quick response! I put all my headers into the format of >acc|xxxx blablabla but only the first entry in the multifasta file had its taxon id successfully fetched. The others all have the empty ti|| at the beginning. Any thoughts?

Sanrrone commented 4 years ago

Your welcome, would you share some headers to reproduce the error?, or your entire fasta (or a half) would be perfect, I'm wondering if something goes wrong with the ID selection in the code. Also paste the line code you are using.

cheers

bioinfoMMS commented 4 years ago

I attached a portion of my fasta file. I run pastetaxid with:

bash pasteTaxID.bash --multifasta viralseqs.txt --parallelJobs 8 --apikey myncbiapikey

I was able to get the example multifasta file to work, with all IDs added

viralseqs.txt

Sanrrone commented 4 years ago

Thanks Maddy, I found the bug and already fixed it, the problem was only with the parallel works that chrash when taxID is found and connection retries is still trying to fetch the ID. try using this fixed code: https://github.com/Sanrrone/pasteTaxID while I create the pull request to this repo.

also I attached the new fasta with the corresponding taxID. new_viralseqs.txt

let me know if it works to close the issue or continue improving the code.

cheers

bioinfoMMS commented 4 years ago

The fix seems to be giving me the taxon ids now. Thank you!

However, I have noticed that the script sometimes get stuck getting IDs from NCBI (the output with --debug shows that its tries to fetch the ID from NCBI over and over again with no luck). It does not appear to get stuck on the file I sent you (though its been running over 40 mins now with 8 processors and is not done), but another set of sequences (attached) gets stuck. It sounds similar to the problem mladen5000 had in March. I am also submitting it as a remote job.

test_viral_2.txt

Sanrrone commented 4 years ago

I checked some info about the parallel job for ncbi api key and I found they provide a maximum of 10 request per second (https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)

Now that I have a key, are there still access limits?

Yes. By default, your key will increase the limit to 10 requests/second for all activity from that key. 
If we receive requests at a higher frequency that include the same key, all requests using that key will receive an error message. 
If you need higher rates of access, please contact us and we can discuss your situation.

I checked this using parallel jobs at 8, and 20. Using 8 parallel process I have no error and the new fasta is fine, but using 20 (that is more than ncbi let you fetch), the script get stuck, the retries are forcing for a new connection because ncbi return an error or just "blank" result (the retry is just an implement to avoid errors like internet connection but I didn't expect an overrequest error).

so, the solution is just not let the parallel job be more than 10, if you have 50 cores, anyway set it to 10.

let me know if it's continues getting stuck using just 10 process.

Sanrrone commented 4 years ago

@bioinfoMMS an update Maddy, the script sometimes reach more than one fetch per process, so try to set the jobs to 9 insteaf of 10, check this performance:

sandro@elitedesk1$ time $(bash pasteTaxID.bash --multifasta test_viral_2.txt --parallelJobs 10 --apikey mykey)

real    2m46.568s
user    1m8.963s
sys 0m26.808s
sandro@elitedesk1$ time $(bash pasteTaxID.bash --multifasta test_viral_2.txt --parallelJobs 9 --apikey mykey)

real    0m54.901s
user    0m17.986s
sys 0m7.231s
sandro@elitedesk1:$ time $(bash pasteTaxID.bash --multifasta test_viral_2.txt --parallelJobs 8 --apikey mykey)

real    0m55.204s
user    0m15.851s
sys 0m6.516s

cheers

bioinfoMMS commented 4 years ago

I still had the problem with 8 processors. It still could be something due to NCBI's limit but it is strange that you are able to run it with no problems and really quickly with the same number of processors. It also seems to have gotten stuck for me on the first file I sent you as well :(

bioinfoMMS commented 4 years ago

Ok so I ran it with no parallelization:

bash pasteTaxID.bash --multifasta test_viral_2.txt --debug

And it still gets stuck , the problematic taxon id was JF714137.1 if that helps. I have it as 'acc' even though it is a Genbank sequence. Would this cause a problem? Does it need to be labeled as 'gbk'?

Sanrrone commented 4 years ago

mmm, a it's weird behaviour. I just already run the script with a fake fasta

>acc|JF714137.1 fakeID
CGCAGTCAGTCAGTCACGTACAGTCACGATCA

and works

>ti|1000308|acc|JF714137.1 fakeID
CGCAGTCAGTCAGTCACGTACAGTCACGATCA

I'm thinking about the internet, sometimes I remind stucks when it was slow or "intermittent", have you another connection? (maybe cellphone?), sorry I have no more ideas at the moment, it not seems a code problem :(.

bioinfoMMS commented 4 years ago

It works well enough that I should be able to use your script, thank you again for all your help and quick replies! It is much appreciated!

Sanrrone commented 4 years ago

your welcome :), feel free to ask any other question.

good luck with your work.

microgenomics / pasteTaxID

Required FASTA header format? #10