Closed victor-ramos closed 9 months ago
Hi Victor, thanks for reaching out about this!
Can you send the commands used to run gctree
, including the deduplicate
and mkconfig
steps?
Also, I understand that the file clone_id_3.inference.1.fasta
in the first screenshot is the output fasta containing the sequences in one of the output trees, is that correct? Is one of the trees in your screenshots the corresponding output tree? If not, could you also send the image of that corresponding tree?
Dear Wiliam, thanks for taking a look into that.
Can you send the commands used to run
gctree
, including thededuplicate
andmkconfig
steps?
Sure, commands below:
deduplicate clone_id_3_ali.fasta --root GL --abundance_file clone_id_3_abundances.csv --idmapfile clone_id_3_idmap.txt > clone_id_3_dedup.phylip
mkconfig clone_id_3_dedup.phylip dnapars > dnapars.cfg
dnapars < dnapars.cfg > {log}
gctree infer --root GL --idlabel --summarize_forest --colormapfile clone_id_3_seqnames_colorfile.tsv --idmapfile clone_id_3_idmap.txt --verbose --outbase {params.out_dir} outfile clone_id_3_abundances.csv > {log} 2>&1
I also uploaded all input/output files to Dropbox from the commands listed above.
Also, I understand that the file
clone_id_3.inference.1.fasta
in the first screenshot is the output fasta containing the sequences in one of the output trees, is that correct?
Yes, the file clone_id_3.inference.1.fasta
contains the sequences in the only tree inferred by GCTree for clone_id_3.
Is one of the trees in your screenshots the corresponding output tree? If not, could you also send the image of that corresponding tree?
The second screenshot was just an example of another clone containing accurately-colored identical sequences.
I hope I understood your questions, let me know if anything is not clear or if you need anything else. Thanks.
I had a bit of trouble recreating your issue.
After running the commands you sent, the gctree infer
command gives the following error:
Traceback (most recent call last):
File "/Users/willdumm/Downloads/hdagenv/bin/gctree", line 8, in <module>
sys.exit(main())
File "/Users/willdumm/Downloads/hdagenv/lib/python3.9/site-packages/gctree/cli.py", line 739, in main
args.func(args)
File "/Users/willdumm/Downloads/hdagenv/lib/python3.9/site-packages/gctree/cli.py", line 239, in infer
seqid, color = line.rstrip().split("\t")
ValueError: too many values to unpack (expected 2)
This error comes from the format of the file clone_id_3_seqnames_colorfile.tsv
, which has three columns instead of two.
GL NA
seq1 &&P4_3_3_5_1160-2mRG&P4_3_3_5_1160-2mRK& #D35000
seq2 &&P4_3_3_5_1171-2mRG&P4_3_3_5_1171-2mRK& #D35000
I had to remind myself what the format of this file was supposed to be. Here's the relevant bit of the output from gctree infer --help
:
--colormapfile COLORMAPFILE
File containing color map in the tab-separated format: ``"SeqID color"`` (default: None)
After modifying the colormap file to this format, I was able to get output files that show your issue.
I'm not sure I understand the issue with coloring of nodes, but I do see that seq2
is being assigned an abundance of 4 in the output files, while it should only have abundance 3. Also, the leaf node seq1
is missing. Do you agree with this description of the issue? I'm working on figuring this out, and I'll reply again when I do. In the meantime, the corrected color map file is attached (note I changed the extension to csv only so I could upload to GitHub)
I've figured out what's going on here. Seq1 has an ambiguous character (an N 23 bases before the end). As an experimental feature, gctree will disambiguate this site in its tree context using parsimony, and in this case that N is changed to a T, which makes the sequence identical to Seq2.
Therefore, it is correct behavior for the output tree to show abundance of 4 instead of 3, because the observed sequence which is named Seq1 is included in that node, too. As you pointed out, color mapping via the command line doesn't work correctly, because it uses a mapping that doesn't account for this collapse of Seq1 into the Seq2 node. This is a bug that should be fixed.
However, I would recommend that whenever possible you eliminate ambiguities from your input sequences, either by replacing them with a gap character or resolving them based on other sequences, if there is only one base observed at that site. In this case, every non-ambiguous sequence has a T at the relevant site, so the sequences containing N's should really just have that site changed to a T (Any parsimony-based disambiguation would do this, anyway).
Making this change to your input alignment, gctree seems to work as expected, and should color the node correctly.
Let me know if this resolves your issue!
I just confirmed that for me, the issue is fixed by eliminating N
's as described in my last reply, and providing a correctly formatted colormap file to gctree infer. For example,
GL NA
seq1 #D35000:2,#E72802:1,#FACE1B:1
seq2 #D35000
seq3 #E72802
seq4 #E72802
seq5 #FACE1B
seq6 #FACE1B
seq7 #FACE1B
seq8 #FACE1B
Note that if you do something like
GL NA
seq1 #D35000:1,#D35000:1,#E72802:1,#FACE1B:1
seq2 #D35000
seq3 #E72802
seq4 #E72802
seq5 #FACE1B
seq6 #FACE1B
seq7 #FACE1B
seq8 #FACE1B
the nodes will be colored incorrectly, since the colors are stored as keys in a dictionary, and the key #D35000
will have its value set to 1 twice.
In fact, using your original input files without replacing N
's and using a correctly formatted colormap file that accounts for the fact that Seq2
should have an abundance of 4 also fixes your issue. For example, the following colormap file works for me with your original inputs:
GL NA
seq2 #D35000:2,#E72802:1,#FACE1B:1
seq3 #D35000
seq4 #E72802
seq5 #E72802
seq6 #FACE1B
seq7 #FACE1B
seq8 #FACE1B
seq9 #FACE1B
This gives the output that should be expected, I think:
Sorry for the roundabout path to a solution, let me know if I can help any further!
Hey Wiliam, Sorry for the late reply.
Sorry for the problem you had to replicate the results.
Thanks for quickly finding the problem!
However, I would recommend that whenever possible you eliminate ambiguities from your input sequences, either by replacing them with a gap character or resolving them based on other sequences, if there is only one base observed at that site. In this case, every non-ambiguous sequence has a T at the relevant site, so the sequences containing N's should really just have that site changed to a T (Any parsimony-based disambiguation would do this, anyway).
I see, we are going to pay attention to that and temporarily implement this solution to our pipeline.
the nodes will be colored incorrectly, since the colors are stored as keys in a dictionary, and the key #D35000 will have its value set to 1 twice. Got it, thanks !
Sorry for the roundabout path to a solution, let me know if I can help any further! That doesn't look like a tricky solution, we can deal with that in the mean time !
Thanks a lot for checking the GCTree behavior!
Victor
As illustrated in the picture below, the generated tree includes a node (seq2) with three identical sequences. While the deduplication output (abundances.csv) accurately tallies the sequences, the inference.fasta file contains an extra sequence for seq2 (abundance=4). Consequently, the tree anticipates four sequences, but we actually have three, leading to a discrepancy in how the node is colored compared to the information in our file.
This happened for a couple of clones only, in general when we have identical sequences, the nodes are colored accurately.
GCTree version 4.1.2 was used to generate the trees. Files necessary to reproduce the results, on Dropbox