mptrsen / Orthograph

Orthology prediction using a graph-based, reciprocal approach with profile hidden Markov models
GNU General Public License v3.0
32 stars 5 forks source link

Orthograph loads 0 COGs from a custom orthogroup description file #29

Closed claumer closed 6 years ago

claumer commented 6 years ago

Hi Orthograph developers,

I'm trying to get Orthograph to extract orthologous CDS from a target-capture dataset I've generated. I've found a reasonable approach to de novo assemble the captured regions and have verified in an informal way (blastx parsing) that many of the resulting contigs likely represent the exons I'm interested in, but now I want to extract these in a formal way, assign them to known orthologs, pad out missing CDS, etc. Orthograph and its integration of exonerate seems like it should do the trick nicely :-)

However, I'm having some trouble getting Orthograph to recognize my COGs. I thought it might improve my detection to use a custom ortholog set created in OMA with 6 representative species in the clade I'm interested in as input. I've generated a 3-column tab delimited file as discussed in the README, using a python script to parse and reformat one of the OMA output tables. Below, a "head" style glimpse:


OMA00001 Gapp_18771 Gapp OMA00001 Pvit_34512 Pvit OMA00001 Sell_9461 Sell OMA00006 Gapp_26049 Gapp OMA00006 Pvit_22501 Pvit OMA00006 Sell_16595 Sell OMA00007 Gapp_26826 Gapp OMA00007 Palp_1921 Palp OMA00007 ProI_11871 ProI OMA00007 Sell_13635 Sell OMA00008 Gapp_1573 Gapp OMA00008 Pvit_942 Pvit OMA00008 Sell_16001 Sell OMA00012 Gapp_12941 Gapp OMA00012 ProI_4428 ProI OMA00012 Pvit_5836 Pvit OMA00013 Gapp_17852 Gapp OMA00013 Pvit_18561 Pvit OMA00013 Sell_21850 Sell OMA00014 Gapp_20059 Gapp OMA00014 Pvit_30 Pvit OMA00014 Sell_12508 Sell OMA00016 Palp_2942 Palp OMA00016 Pvit_12758 Pvit OMA00016 Sell_10853 Sell OMA00019 Gapp_4780 Gapp OMA00019 ProI_16848 ProI OMA00019 Pvit_30487 Pvit OMA00019 Sell_11331 Sell OMA00020 Gapp_13835 Gapp OMA00020 ProI_14504 ProI OMA00020 Pvit_230 Pvit OMA00020 Sell_14264 Sell OMA00022 Gapp_323 Gapp OMA00022 Pvit_23482 Pvit OMA00022 Sell_19881 Sell OMA00024 Gapp_7166 Gapp OMA00024 Palp_6515 Palp OMA00024 Pvit_16580 Pvit OMA00024 Sell_7430 Sell OMA00024 Xeno_6999 Xeno OMA00029 Gapp_18714 Gapp OMA00029 Pvit_32658 Pvit OMA00029 Sell_19079 Sell OMA00031 Gapp_17954 Gapp OMA00031 Palp_5805 Palp OMA00031 Pvit_6849 Pvit OMA00031 Sell_13238 Sell OMA00031 Xeno_14650 Xeno OMA00032 Gapp_24533 Gapp OMA00032 ProI_10872 ProI OMA00032 Pvit_33107 Pvit OMA00032 Sell_19583 Sell OMA00038 Gapp_2830 Gapp OMA00038 Pvit_27666 Pvit OMA00038 Sell_7195 Sell OMA00041 Gapp_30514 Gapp OMA00041 ProI_17804 ProI OMA00041 Pvit_816 Pvit OMA00041 Sell_14471 Sell OMA00046 Gapp_24070 Gapp OMA00046 Pvit_27527 Pvit OMA00046 Sell_9342 Sell OMA00047 Gapp_17676 Gapp OMA00047 Pvit_34950 Pvit


Running the test suite works fine. My config file, which seems to be recognized by orthograph-manager, looks like this:


database-backend = sqlite

sqlite-database = Orthograph_prorhynchida_OMA/prorhynchida_tefull.sqlite

sqlite-program = /usr/bin/sqlite3

input-file = /nfs/leia/research/marioni/claumer/TEFULL/Prorhynchus_cf_stagnalis_SA01_DNA106252_CGCTCATT-GGCTCTGA_out/contig.fa

species-name = Pro_cf_stag

ortholog-set = prorhynchida_oma

output-directory = Orthograph_prorhynchida_OMA/Pro_cf_stag

reference-taxa = Xeno,Gapp,Palp,ProI,Pvit,Sell


And loading the peptide OGS files into the database seems to work alright as well:

[claumer@hh-yoda-04-01 Orthograph]$ ./orthograph-manager --load-ogs-peptide Orthograph_prorhynchida_OMA/Xeno.pep --ogs-taxon-name Xeno --ogs-version 1 Using taxon name 'Xeno' Using OGS version '1' Data look OK, uploading...

15863 sequences for 'Xeno' in database 'Orthograph_prorhynchida_OMA/prorhynchida_tefull.sqlite' on localhost. [claumer@hh-yoda-04-01 Orthograph]$ ./orthograph-manager --load-ogs-peptide Orthograph_prorhynchida_OMA/Gapp.pep --ogs-taxon-name Gapp --ogs-version 1 Using taxon name 'Gapp' Using OGS version '1' Data look OK, uploading...

31664 sequences for 'Gapp' in database 'Orthograph_prorhynchida_OMA/prorhynchida_tefull.sqlite' on localhost. [claumer@hh-yoda-04-01 Orthograph]$ ./orthograph-manager --load-ogs-peptide Orthograph_prorhynchida_OMA/ProI.pep --ogs-taxon-name ProI --ogs-version 1 Using taxon name 'ProI' Using OGS version '1' Data look OK, uploading...

19797 sequences for 'ProI' in database 'Orthograph_prorhynchida_OMA/prorhynchida_tefull.sqlite' on localhost. [claumer@hh-yoda-04-01 Orthograph]$ ./orthograph-manager --load-ogs-peptide Orthograph_prorhynchida_OMA/Palp.pep --ogs-taxon-name Palp --ogs-version 1 Using taxon name 'Palp' Using OGS version '1' Data look OK, uploading...

21147 sequences for 'Palp' in database 'Orthograph_prorhynchida_OMA/prorhynchida_tefull.sqlite' on localhost. [claumer@hh-yoda-04-01 Orthograph]$ ./orthograph-manager --load-ogs-peptide Orthograph_prorhynchida_OMA/Pvit.pep --ogs-taxon-name Pvit --ogs-version 1 Using taxon name 'Pvit' Using OGS version '1' Data look OK, uploading...

40862 sequences for 'Pvit' in database 'Orthograph_prorhynchida_OMA/prorhynchida_tefull.sqlite' on localhost. [claumer@hh-yoda-04-01 Orthograph]$ ./orthograph-manager --load-ogs-peptide Orthograph_prorhynchida_OMA/Sell.pep --ogs-taxon-name Sell --ogs-version 1 Using taxon name 'Sell' Using OGS version '1' Data look OK, uploading...

22075 sequences for 'Sell' in database 'Orthograph_prorhynchida_OMA/prorhynchida_tefull.sqlite' on localhost.


However, when I load the 3-column ortholog file, giving the the same ortholog set name as in my .conf, something weird happens:

[claumer@hh-yoda-04-01 Orthograph]$ ./orthograph-manager Orthograph_min3_cogfile.txt Setting up core ortholog set from 'Orthograph_min3_cogfile.txt' in database 'Orthograph_prorhynchida_OMA/prorhynchida_tefull.sqlite' on host 'localhost'... (press CTRL+C to abort)

Enter the set name (required; ASCII only, no commas!): prorhynchida_oma

Enter a description for the set (optional but recommended). IMPORTANT: Do not use any commas (,): prorhynchida_oma

This is a list of present OGS in the database: [#] Taxon name version type sequences

[1] Xeno 1 aa 15863 [2] Gapp 1 aa 31664 [3] ProI 1 aa 19797 [4] Palp 1 aa 21147 [5] Pvit 1 aa 40862 [6] Sell 1 aa 22075

or 0 if you don't want to use this OGS: 1 or 0 if you don't want to use this OGS: 2 or 0 if you don't want to use this OGS: 3 or 0 if you don't want to use this OGS: 4 or 0 if you don't want to use this OGS: 5 or 0 if you don't want to use this OGS: 6 Done: 0 orthologous sequence relationships (0 COGs) for set prorhynchida_oma

Immediately after giving the set name/set description and hitting enter, the program seems to require some user prompting after printing "This is a list of present OGS in the database: etc". If I hit enter, it will spit out the name of one of the taxa in the set. If I type the number representing one or another of the input OGS taxa, it will return me to the prompt. If I type all numbers representing all species in sequence, it prints as shown above: "Done". However, it seems to have failed to load any COGs, even though I think my input COG description is formatted as you describe.

So my questions are;

1.) is this kind of behavior by orthograph-manager when inputting my COG description expected? If so, may I suggest you update the documentation a little to make this step a little clearer? It's not at all obvious what the user is expected to do when faced with a prompt that says "or 0 if you don't want to use this OGS:". I'm still not sure I'm driving it correctly.

2.) Secondly -- why is Orthograph failing to recognize any COGs? It is true that in my custom reference set, the number of taxa represented in each OMA ortholog varies (from 3-6). However, I don't see why this should be a problem in theory -- surely subclade-specific orthogroups should be allowed as reference OGs? Or are we really expected to have every species represented for every input OG to run this program? (this is seemingly what is implied by the comment in the README, "Does it contain the right number of lines (number of taxa * number of COGs)?")

Thanks very kindly for considering this -- look forward to hearing your thoughts.

Regards, Chris L

claumer commented 6 years ago

Hi Malte,

I just tried running Orthograph with a reduced version of my custom COG table where all taxa are present in each orthogroup. It's still giving me the same failure modality, apparently asking me to select taxa I want to include, and then outputting "Done: 0 orthologous sequence relationships (0 COGs) for set prorhynchida_oma".

So I think something is wrong with the parsing of my COG table. I can't see anything obvious that contrasts with your description of how to format this - no trailing whitespace, no column headers, sequence_IDs identical to fasta headers, only three columns separated by tabs. I can't tell if this is a bug or something I've simply misunderstood, but either way, I'd really appreciate your input.

Regards, Chris L

mptrsen commented 6 years ago

This is an old issue, I was able to resolve it with Chris via email. I just wanted to post my response here and close it.


I am wondering why that last step setting up the database is going wrong for you. Orthograph needs to connect the ortholog relations in the table to the sequences in the database, each of which is associated with one reference species. This is why Orthograph, after reading the table, will ask you for each species it found for the corresponding OGS in the database.

I am wondering about the section in the issue where you describe what you
see. Is this what you get verbatim or did the formatting swallow some
information?

    This is a list of present OGS in the database:
    [#] Taxon name version type sequences

    [1] Xeno 1 aa 15863
    [2] Gapp 1 aa 31664
    [3] ProI 1 aa 19797
    [4] Palp 1 aa 21147
    [5] Pvit 1 aa 40862
    [6] Sell 1 aa 22075

    or 0 if you don't want to use this OGS: 1
    or 0 if you don't want to use this OGS: 2
    or 0 if you don't want to use this OGS: 3
    or 0 if you don't want to use this OGS: 4
    or 0 if you don't want to use this OGS: 5
    or 0 if you don't want to use this OGS: 6
    Done: 0 orthologous sequence relationships (0 COGs) for set prorhynchida_oma

I am asking because it should look something like this:

    Enter the OGS ID for Xeno or 0 if you don't want to use this OGS: 1
    Enter the OGS ID for Gapp or 0 if you don't want to use this OGS: 2
    ...

So I wonder if something swallowed the first part of the prompt and if so, what did this. Since the prompt gets cut off right after the species shorthand, it looks like Windows line endings (CR/LF) might be the culprit. The table you attached has UNIX style line endings (LF). Could you please verify that this is also the case for the table you used?