hallamlab / MetaPathways

A modular pipeline for constructing Pathway/Genome Databases from environmental sequence information
http://hallam.microbiology.ubc.ca/MetaPathways
12 stars 7 forks source link

refseq-names.txt file is missing #33

Closed xapple closed 7 years ago

xapple commented 11 years ago

As I tried to add the refseq database to the metapathways pipeline, I am getting an additional error which I am uncertain how to solve. Here is the transcript:

  ********************************************************** 
  **************** Running  MetaPathways ******************* 
  ********************************************************** 
              /bubo/home/h3/lucass/HUMIC/processed_test/samples/test-h-hyp-mek/metapathways/test-h-hyp-mek.fasta                       
  ********************************************************** 

[.... snip .....]

Issuing Command : /bubo/home/h3/lucass/proj38/nobackup/metapathways/libs/python_scripts/MetaPathways_parse_blast.py -d cog  -b /bubo/home/h3/lucass/HUMIC/processed_test/samples/test-h-hyp-mek/metapathways/output//test-h-hyp-mek/blast_results//test-h-hyp-mek.cog.blastout -m /bubo/home/h3/lucass/proj38/nobackup/metapathways/blastDB///COG_2013-02-05-names.txt  -r  /bubo/home/h3/lucass/HUMIC/processed_test/samples/test-h-hyp-mek/metapathways/output//test-h-hyp-mek/blast_results//test-h-hyp-mek.refscores  --min_bsr 0.400000  --min_score 20.000000 --min_length 60.000000 --max_evalue 0.000001 --algorithm BLAST

7. Parsing blast outputs for reference database - cog ......... Success!

Issuing Command : /bubo/home/h3/lucass/proj38/nobackup/metapathways/libs/python_scripts/MetaPathways_parse_blast.py -d refseq  -b /bubo/home/h3/lucass/HUMIC/processed_test/samples/test-h-hyp-mek/metapathways/output//test-h-hyp-mek/blast_results//test-h-hyp-mek.refseq.blastout -m /bubo/home/h3/lucass/proj38/nobackup/metapathways/blastDB///refseq_protein-names.txt  -r  /bubo/home/h3/lucass/HUMIC/processed_test/samples/test-h-hyp-mek/metapathways/output//test-h-hyp-mek/blast_results//test-h-hyp-mek.refscores  --min_bsr 0.400000  --min_score 20.000000 --min_length 60.000000 --max_evalue 0.000001 --algorithm BLAST

                                                  refseq ......... Error!
template_config.txt
/bubo/home/h3/lucass/HUMIC/processed_test/samples/test-h-hyp-mek/metapathways/config.txt
/bubo/home/h3/lucass/HUMIC/processed_test/samples/test-h-hyp-mek/metapathways/test-h-hyp-mek.fasta
/bubo/home/h3/lucass/HUMIC/processed_test/samples/test-h-hyp-mek/metapathways/output//test-h-hyp-mek

Error! : File /bubo/home/h3/lucass/proj38/nobackup/metapathways/blastDB///refseq_protein-names.txt seems to be empty!

Apparently the -names.txt file contains only the headers of the sequences present in a given database. However, the refseq database is provided directly in BLAST binary format from NCBI. We could use the fastacmd software to reverse this and generate the plaintext version. Subsequently extracting only the headers, but I'm not sure that would be a good way to proceed. The result would contain several annotations per sequence:

>gi|15674171|ref|NP_268346.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis Il1403] >gi|116513137|ref|YP_812044.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris SK11] >gi|125625229|ref|YP_001033712.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris MG1363] >gi|281492845|ref|YP_003354825.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis KF147] >gi|385831755|ref|YP_005869568.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis CV56] >gi|385839508|ref|YP_005877138.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris A76] >gi|389855617|ref|YP_006357861.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris NZ9000] >gi|414075194|ref|YP_007000411.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris UC509.9] >gi|459286377|ref|YP_007509482.1| 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis IO-1] >gi|489223532|ref|WP_003131952.1| 30S ribosomal protein S18 [Lactococcus lactis]
MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ
N
>gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]
MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY
KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK
VQKLLNPDQ
>gi|66818355|ref|XP_642837.1| hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]
MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE
DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
IEKQKNHIVIGGSSLNSLPVSLPTVKSNESNESNTISINSENNNSKVSTDDTINNVM
>gi|30262378|ref|NP_844755.1| mbtH-like protein [Bacillus anthracis str. Ames] >gi|47527668|ref|YP_019017.1| balhimycin biosynthetic protein MbtH [Bacillus anthracis str. 'Ames Ancestor'] >gi|49185218|ref|YP_028470.1| balhimycin biosynthetic protein MbtH [Bacillus anthracis str. Sterne] >gi|49479960|ref|YP_036475.1| balhimycin biosynthetic protein MbtH [Bacillus thuringiensis serovar konkukian str. 97-27] >gi|118477774|ref|YP_894925.1| mbtH-like protein [Bacillus thuringiensis str. Al Hakam] >gi|218903507|ref|YP_002451341.1| mbtH-like protein [Bacillus cereus AH820] >gi|227814816|ref|YP_002814825.1| mbtH-like protein [Bacillus anthracis str. CDC 684] >gi|229603791|ref|YP_002866710.1| mbtH-like protein [Bacillus anthracis str. A0248] >gi|301053892|ref|YP_003792103.1| balhimycin biosynthetic protein MbtH [Bacillus cereus biovar anthracis str. CI] >gi|376266284|ref|YP_005118996.1| Polymyxin synthetase PmxB [Bacillus cereus F837/76] >gi|386736125|ref|YP_006209306.1| MbtH-like protein [Bacillus anthracis str. H9401] >gi|446106212|ref|WP_000184067.1| antibiotic transporter [Bacillus anthracis]
MTNPFENDNYTYKVLKNEEGQYSLWPAFLDVPIGWNVVHKEASRNDCLQYVENNWEDLNPKSNQVGKKILVGKR
nielshanson commented 11 years ago

Hey,

That's no good. So the -names.txt file is generated automatically when compiling blast./last databases from the original sequences in fasta format.

Are you compiling the database from the original sequence or using the precompiled database from the NCBI's ftp (ftp://ftp.ncbi.nlm.nih.gov/blast/db)? I'm not sure if a compiled database was dropped a precompiled database in there it would pickup the files. Since this is the main way people will get their sequences we should update to the code to regenerate the names.txt file if not present.

To get the pipeline to regenerate file you'll have to get the original sequence from that compiled refseq database. You can get the fasta sequences out of a compiled blast database with the blastdbcmd in the blast+ package. It would be something like:

$ blastdbcmd -db refseq -dbtype prot -outfmt %f -out refseq_sequences.fasta

You can put the resulting refseq.fasta command in the parameter file and run again. It should generate a -names.txt when formatting the database. This could take a while to extract all the sequences. Just watch your namespace when running the pipeline again to make sure that it picks up your newly extracted sequences and not the original compiled database.

Since doing the above is a bit of work, I'll provide you with my refseq names file to see if that fixes the problem in the interim.

Let me know how it goes.

xapple commented 11 years ago

I downloaded the files 1 to 8 from ftp://ftp.ncbi.nlm.nih.gov/blast/db/refseq_protein.0*.tar.gz as per the instructions on the documentation at "4. BLAST databases". You should probably update the doc to reflect the -names.txt file issue.

Then I used the fastacmd -D 1 -d refseq_protein -o plaintext.fasta command provided in the blast package (not +) to extract the sequences as there doesn't seem to be any corresponding file available on the NCBI FTP website in ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA. It produced the file I gave an excerpt of in my previous post.

If you think we just need to get the headers from that plaintext.fasta in the refseq_protein-names.txt file then a simple sed command should suffice. Or is the process more complicated ? Do you need my e-mail to send me your version of the file ?

nielshanson commented 11 years ago

Try your sed command to extract all the header lines from that fasta. If you have no luck, ping me again with your email and I'll get you my file. If all else fails point the config to the new plaintext.fasta file (may want to rename it to something more refseq-like) and MetaPathways will reformat and do the whole thing again.

I'm almost finished the new documentation. My apologies on the delay, there's a lot going on around here.

Niels

xapple commented 11 years ago

OK I will try that ! No prob, I know that a developers time is always one of the most required resources in this world : )

We will continue trying to get this working as it would be nice to be able to analyze our metagenomes samples using this pipeline. The next problem is probably going to be parallelizing the BLAST jobs as we don't have a SUN cluster but instead a SLURM one.

sunitj commented 10 years ago

Just wanted to mention here that i ran into an issue similar to xapple's and his fastacmd solution seems to have worked for me. I had to leave the refseq_protein fasta file in the blastdb.

hallamlab commented 7 years ago

Please use the newest version of MP on GitHub

hallamlab commented 7 years ago

Please use the newest version of MP on GitHub