marbl / metAMOS

A metagenomic and isolate assembly and analysis pipeline built with AMOS
http://marbl.github.io/metAMOS
Other
93 stars 45 forks source link

ORFs sequences not found #136

Open hcecilia opened 10 years ago

hcecilia commented 10 years ago

Hello, I've been doing some test with 454 data. I'm running metAMOS with metavelvet and fraggenescan, and stop at the FindORFS step. I initially wanted to use MetaGeneMark but it's not available for now. In the outputs of FindORFs, there is no proba.orf.faa/fna. I guess I could make a script to create it from proba.ctg.fna and proba.orfs but I don't understand the logic of these files. For instance, what do the nodes represent? And why are they cut in several sequences?

NODE_112_length_527_cov_1.609108_17274- GCGGAAGTGTTGCAAATTTGGGTGGATGGCAGTTGTACCGGCAATCAAAACCAGCCGGGCAAGATTGCGGGTCCTTTGCCGAAAGCCCGGTCAGCCGCTTATTTTCCCAGTCTTCAAATAGGCCGCAGTTGTGATTTACCGGGGGCAACACAAACCAACATTCGAGCTGAATTGTTTGCGGTCTTGTTAGCCTTTGCGGAATTAGAGCGAATGGGTATTCAAGCCGGCCATCTCGAATTTTTACCGATTGTT NODE_112_length_527_cov_1.609108_280557- CGATTGCCTTGCAAAAGTTACAAATCGAGACTCTCAAGTCACAAAATGTCCCAGCCGCCGCATAATGTCTCCGATCCGGAAGAAGTCGAAGTCGCTCAAGATTTGATTACAAAACTGGTCATCCGCGCCGACGTCGAGTTATCCGGCTTTCAACAACGCGTGGAATCCCGGCTCCAAGAACAAATCGAAGAATTGCGGACGGAAAACCGGATCTTGGCGGTTGGCGTTGGAATTGCCCTTTTACTGGGTCTTGTCCGGTGGGCTTCATGTCTT

Some nodes appear only once but the sequence doesn't have the lenght written:

NODE_2_length_63_cov_4.253968_193+ ATCATGGCAAATATGGGATTGATTTTACCTAAAAATAATTTCCTCTCTGAGGTCAGAAAGATCACAAAGGACAATGATATCCCTCTAATT (this one is longer than 90bp)

I hope I'm being clear..

skoren commented 10 years ago

The proba.orf.faa/fna files are only available once the pipeline completes in Postprocess/out. In FindORFS/out they are named just proba.faa/fna. (The files proba.faa/fna and proba.ctg.fna/faa should be the same unless you turn on ORF calling on sequences which is off by default).

You do not need to do any further parsing of the files. The names are generated by FragGeneScan and are constructed as _. In your case all the text "NODE_112_length_527_cov_1.609108" is output by the assembler and is the contig name. If a contig has more than one ORF, it will have multiple entries with the same starting contig name but at different positions within the contig. The sizes are output by FragGeneScan and should match the length of the sequences reported in the files.

hcecilia commented 10 years ago

Ok, but then, how is it possible to find ORFs of lenght superior to contig's lenght ? And why, in the Assemble/out, do I find files with similar header: (proba.fna/metavelvet31.fna)

NODE_14_length_151_cov_1.000000_1181+ even if at this step of the pipeline the ORFs haven't been searched yet? In this directory, contigs.fa/meta-velvetg.contigs.fa seem to be the contigs sequences, but the lenghts don't match between the header and the sequence itself. I'm concerned because I want to write a script that selects the contigs of lenght superior to 300 but it seems like I can't trust the lenght written in the header so I will have to count the bases of each sequence..

Don't hesitate to tell me if I'm asking inapropriate questions and should document myself elsewhere (looking for metavelvet manual or so)!

skoren commented 10 years ago

The fna file in the Assemble/out directory is also the same as the FindORFS/proba.fna. The same file is hard linked to by several places within metAMOS's internal directories. The assembly with no ORF calling is in Assemble/out/proba.asm.contig

As far as the length of the gene call versus the contig, it is possible one of the programs (FragGeneScan or MetaVelvet) is not reporting the size correctly. Have you checked the size of the entry in the fna file to see if it matches the Velvet length or the FragGeneScan length?