Disagreement in node coloring for (some) identical sequences

matsengrp / gctree

GCtree: phylogenetic inference of genotype-collapsed trees

https://matsengrp.github.io/gctree

GNU General Public License v3.0

16 stars 2 forks source link

Disagreement in node coloring for (some) identical sequences #120

Closed victor-ramos closed 9 months ago

victor-ramos commented 9 months ago

As illustrated in the picture below, the generated tree includes a node (seq2) with three identical sequences. While the deduplication output (abundances.csv) accurately tallies the sequences, the inference.fasta file contains an extra sequence for seq2 (abundance=4). Consequently, the tree anticipates four sequences, but we actually have three, leading to a discrepancy in how the node is colored compared to the information in our file.

This happened for a couple of clones only, in general when we have identical sequences, the nodes are colored accurately.

GCTree version 4.1.2 was used to generate the trees. Files necessary to reproduce the results, on Dropbox

willdumm commented 9 months ago

Hi Victor, thanks for reaching out about this!

Can you send the commands used to run gctree, including the deduplicate and mkconfig steps?

Also, I understand that the file clone_id_3.inference.1.fasta in the first screenshot is the output fasta containing the sequences in one of the output trees, is that correct? Is one of the trees in your screenshots the corresponding output tree? If not, could you also send the image of that corresponding tree?

victor-ramos commented 9 months ago

Dear Wiliam, thanks for taking a look into that.

Can you send the commands used to run gctree, including the deduplicate and mkconfig steps?

Sure, commands below:

deduplicate clone_id_3_ali.fasta --root GL --abundance_file clone_id_3_abundances.csv --idmapfile clone_id_3_idmap.txt > clone_id_3_dedup.phylip

mkconfig clone_id_3_dedup.phylip dnapars > dnapars.cfg dnapars < dnapars.cfg > {log}

gctree infer --root GL --idlabel --summarize_forest --colormapfile clone_id_3_seqnames_colorfile.tsv --idmapfile clone_id_3_idmap.txt --verbose --outbase {params.out_dir} outfile clone_id_3_abundances.csv > {log} 2>&1

I also uploaded all input/output files to Dropbox from the commands listed above.

Also, I understand that the file clone_id_3.inference.1.fasta in the first screenshot is the output fasta containing the sequences in one of the output trees, is that correct?

Yes, the file clone_id_3.inference.1.fasta contains the sequences in the only tree inferred by GCTree for clone_id_3.

Is one of the trees in your screenshots the corresponding output tree? If not, could you also send the image of that corresponding tree?

The second screenshot was just an example of another clone containing accurately-colored identical sequences.

I hope I understood your questions, let me know if anything is not clear or if you need anything else. Thanks.

willdumm commented 9 months ago

I had a bit of trouble recreating your issue.

After running the commands you sent, the gctree infer command gives the following error:

Traceback (most recent call last):
  File "/Users/willdumm/Downloads/hdagenv/bin/gctree", line 8, in <module>
    sys.exit(main())
  File "/Users/willdumm/Downloads/hdagenv/lib/python3.9/site-packages/gctree/cli.py", line 739, in main
    args.func(args)
  File "/Users/willdumm/Downloads/hdagenv/lib/python3.9/site-packages/gctree/cli.py", line 239, in infer
    seqid, color = line.rstrip().split("\t")
ValueError: too many values to unpack (expected 2)

This error comes from the format of the file clone_id_3_seqnames_colorfile.tsv, which has three columns instead of two.

GL              NA
seq1    &&P4_3_3_5_1160-2mRG&P4_3_3_5_1160-2mRK&        #D35000
seq2    &&P4_3_3_5_1171-2mRG&P4_3_3_5_1171-2mRK&        #D35000

I had to remind myself what the format of this file was supposed to be. Here's the relevant bit of the output from gctree infer --help:

  --colormapfile COLORMAPFILE
                        File containing color map in the tab-separated format: ``"SeqID color"`` (default: None)

After modifying the colormap file to this format, I was able to get output files that show your issue.

I'm not sure I understand the issue with coloring of nodes, but I do see that seq2 is being assigned an abundance of 4 in the output files, while it should only have abundance 3. Also, the leaf node seq1 is missing. Do you agree with this description of the issue? I'm working on figuring this out, and I'll reply again when I do. In the meantime, the corrected color map file is attached (note I changed the extension to csv only so I could upload to GitHub)

clone_id_3_seqnames_colorfile.csv

willdumm commented 9 months ago

I've figured out what's going on here. Seq1 has an ambiguous character (an N 23 bases before the end). As an experimental feature, gctree will disambiguate this site in its tree context using parsimony, and in this case that N is changed to a T, which makes the sequence identical to Seq2.

Therefore, it is correct behavior for the output tree to show abundance of 4 instead of 3, because the observed sequence which is named Seq1 is included in that node, too. As you pointed out, color mapping via the command line doesn't work correctly, because it uses a mapping that doesn't account for this collapse of Seq1 into the Seq2 node. This is a bug that should be fixed.

However, I would recommend that whenever possible you eliminate ambiguities from your input sequences, either by replacing them with a gap character or resolving them based on other sequences, if there is only one base observed at that site. In this case, every non-ambiguous sequence has a T at the relevant site, so the sequences containing N's should really just have that site changed to a T (Any parsimony-based disambiguation would do this, anyway).

Making this change to your input alignment, gctree seems to work as expected, and should color the node correctly.

Let me know if this resolves your issue!

willdumm commented 9 months ago

I just confirmed that for me, the issue is fixed by eliminating N's as described in my last reply, and providing a correctly formatted colormap file to gctree infer. For example,

GL  NA
seq1    #D35000:2,#E72802:1,#FACE1B:1
seq2    #D35000
seq3    #E72802
seq4    #E72802
seq5    #FACE1B
seq6    #FACE1B
seq7    #FACE1B
seq8    #FACE1B

Note that if you do something like

GL  NA
seq1    #D35000:1,#D35000:1,#E72802:1,#FACE1B:1
seq2    #D35000
seq3    #E72802
seq4    #E72802
seq5    #FACE1B
seq6    #FACE1B
seq7    #FACE1B
seq8    #FACE1B

the nodes will be colored incorrectly, since the colors are stored as keys in a dictionary, and the key #D35000 will have its value set to 1 twice.

In fact, using your original input files without replacing N's and using a correctly formatted colormap file that accounts for the fact that Seq2 should have an abundance of 4 also fixes your issue. For example, the following colormap file works for me with your original inputs:

GL  NA
seq2    #D35000:2,#E72802:1,#FACE1B:1
seq3    #D35000
seq4    #E72802
seq5    #E72802
seq6    #FACE1B
seq7    #FACE1B
seq8    #FACE1B
seq9    #FACE1B

This gives the output that should be expected, I think:

testout inference 1

Sorry for the roundabout path to a solution, let me know if I can help any further!

victor-ramos commented 9 months ago

Hey Wiliam, Sorry for the late reply.

Sorry for the problem you had to replicate the results.

Thanks for quickly finding the problem!

However, I would recommend that whenever possible you eliminate ambiguities from your input sequences, either by replacing them with a gap character or resolving them based on other sequences, if there is only one base observed at that site. In this case, every non-ambiguous sequence has a T at the relevant site, so the sequences containing N's should really just have that site changed to a T (Any parsimony-based disambiguation would do this, anyway).

I see, we are going to pay attention to that and temporarily implement this solution to our pipeline.

the nodes will be colored incorrectly, since the colors are stored as keys in a dictionary, and the key #D35000 will have its value set to 1 twice. Got it, thanks !

Sorry for the roundabout path to a solution, let me know if I can help any further! That doesn't look like a tricky solution, we can deal with that in the mean time !

Thanks a lot for checking the GCTree behavior!

Victor