microgenomics / pasteTaxID

This script take your fastas, search for common IDs (ti, gi, gb, emb), get the ti (or gi if is missing), and finally put the ID's in the same fasta
GNU General Public License v2.0
6 stars 4 forks source link

nt -> nt + ti, cannot open 17434647.fasta #6

Open bashirhamidi opened 5 years ago

bashirhamidi commented 5 years ago

hpc@hpc:/media/box1/tb/ncbint$ bash /home/box1/Downloads/pathoscope2/pasteTaxID/pasteTaxID.bash --multifasta nt.fasta --parallelJobs 50

Please note that there is over 4 TB of free space available on the drive so it's not a space limitation.

Sanrrone commented 5 years ago

Hi!, sorry for the delay, just to test some in the script, could you try to split the nt.fasta in two (or four) new files and test the script for one of them?, I'm thinking the great amount of fastas is doing an I/O error.

could you try also adding --debug parameter and paste the lines you get when error appear?.

Best Sandro

bashirhamidi commented 5 years ago

I split it to 6 files and it crashes the server, perhaps due to the I/O error? I'm doing a split further to 5GB files. Is there a way to suppress the script from showing the individual tasks with the fetching and such?

Sanrrone commented 5 years ago

the message are mandatory in that step, could be a next improvement suppress the message. when you mention server, are you logged in a cluster?, could you try run the script locally, and if the problem continues, give me the nt.fasta link to reproduce the error and see in a deep way what happen.

Best Sandro

bashirhamidi commented 5 years ago

Thanks for the response. The database is downloaded directly from NCBI's ftp server. Edit: For some reason the Markdown is not handling the link properly. Here's the ftp link ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz

In response to your other question, yes, I am on a cluster. Per your recommendation, I ran it locally (first splitting the large file into smaller multifastas).

As part of the process, the script occasionally does not find parsefasta.awk (output below). Any idea why that might be?

++ awk '{print $2}'
++ echo 6283.fasta '>XR_003236166.1' PREDICTED: Vulpes vulpes uncharacterized LOC112925609 '(LOC112925609),' transcript variant X1, ncRNA
++ awk -v ID=emb -f parsefasta.awk
+ ti=9627
+ '[' 9627 == '' ']'
+ '[' 9627 '!=' '' ']'
+ echo '5824.fasta 9627'
+ fastaheader='>XR_003236166.1'
++ echo '>XR_003235922.1'
++ awk -v ID=ti -f parsefasta.awk
++ awk -v ID=ref -f parsefasta.awk
++ echo '>XM_026005165.1'
++ echo '>XM_026005517.1'
awk: cannot open parsefasta.awk (No such file or directory)
fetch.bash: line 105: newheader.txt: No such file or directory
++ awk -v ID=emb -f parsefasta.awk
+ ti=
awk: cannot open parsefasta.awk (No such file or directory)
++ echo '>XM_026003518.1'
++ awk -v ID=gi -f parsefasta.awk
fetch.bash: line 105: newheader.txt: No such file or directory
++ echo '>XR_003235453.1'
* Done :D
awk: cannot open parsefasta.awk (No such file or directory)
+ ref=
awk: cannot open parsefasta.awk (No such file or directory)
awk: cannot open parsefasta.awk (No such file or directory)
+ emb=
++ awk -v ID=ref -f parsefasta.awk
+ gi=
+ ti=
++ echo '>XR_003236166.1'
awk: cannot open parsefasta.awk (No such file or directory)
awk: cannot open parsefasta.awk (No such file or directory)
Sanrrone commented 5 years ago

Hi!, sounds like you are not putting the script and the fasta in same directory, have them in the same directory?