maickrau / ribotin

MIT License
30 stars 2 forks source link

How to obtain a complete rDNA assembly based on the output of ribotin? #12

Open tiramisutes opened 2 months ago

tiramisutes commented 2 months ago

Dear, Which file output from ribotin can be used for the next step of analysis to obtain complete rDNA assembly, and how to do it? Thanks.

maickrau commented 2 months ago

Hi, for humans it's typically not possible to get a complete rDNA assembly. There is a file morphgraph.gfa in the output folder which shows how the morphs output by ribotin might be placed next to each others, but I haven't seen any case where it would have been possible to get the complete rDNA arrays from this information. You can try something similar to the CHM13 rDNA assembly where you take the most common morphs, estimate their copy counts and then fill in the array with a model sequence.

For non-humans it might be possible or not depending on how large and divergent the rDNA arrays are, but the same graph morphgraph.gfa would be a starting point.

jiadong324 commented 2 weeks ago

I am running ribotin on human and have the following questions.

  1. I got more than one morphgraph.gfa under subdirectory ribotin0, ribotin1 and etc. This is one of my results in morphgraph.gfa and it seems that I got 10 morphs. What's the difference of these morphgraph.gfa under different subdirectory?

    image
  2. For the example shown above, I am wondering how do you determine if a morph is common or not. In your paper, a morph is defined as the sequence of one complete repeat unit appear in the rDNA arrays once or multiple time. Would you please describe more details of interpreting the outputs?

Thanks! Looking forward to your reply!

maickrau commented 2 weeks ago

Hi,

  1. The different morphgraphs are for the different rDNA tangles in the verkko graph. The folders have files ribotin0/nodes.txt, ribotin1/nodes.txt etc which list the nodes in the verkko graph corresponding to the tangle. So the morphs in the different morphgraph files are located in different rDNA tangles. Different tangles very likely means different rDNA arrays, but the same tangle can and often has multiple rDNA arrays (eg as a made up example, one tangle might have the rDNA arrays of chr13 and chr14 and another tangle chr15, chr21, and chr22)
  2. Common would be the morphs with the highest copy counts. Ribotin doesn't estimate the copy count of the morphs but the coverage listed in the output somewhat corresponds to the copy count. It's not a straightforward division by genomic coverage since only reads which span the entire morph contribute to ribotin's coverage, but you can roughly try to estimate it by counting the coverage of reads longer than 50kbp, which will then be somewhat close to the average coverage per one copy. In that picture there's three morphs with high coverages, tangle2_morphconsensus0_coverage104, tangle2_morphconsensus1_coverage79 and tangle2_morphconsensus2_coverage59 which very likely appear multiple times in this rDNA array. The other morphs have noticably lower coverages which correspond to a low copy count, probably 1-3 copies depending on the morph.
jiadong324 commented 2 weeks ago

Thanks for your explanation!

A few things that I want to further confirm with you:

  1. You mentioned that a tangle might have different rDNA arrays. In my example, the same tangle have three high coverage morph (angle2_morphconsensus0_coverage104, tangle2_morphconsensus1_coverage79, tangle2_morphconsensus2_coverage59). Dose it mean these morphs might be originated from rDNA arrays on different chromosomes.
  2. For a tangle, you create the minimizer graph (i.e., graph.gfa) with HiFi reads for each nodes that saved in nodes.txt. The allele graph from each tange is merged for ONT alignment, while I found ont-alns.gaf for each tangle under ribotin subdirectory. In your paper, it seems that you merged the allele-graph for each tangle and then aligned the ONT reads.
  3. I am trying to understand the coverage for each morph. To count the reads that support each morph in my example, I tried to grep reads from readpaths-morphgraph.gaf, which seems to be created from ont-alns.gaf. I could found there are 75 ONT reads aligned to path tangle2_morphconsensus0_coverage104.
  4. For the variants.vcf, I thought this is the variant among different morphs, is that correct? The variants are called toward a sequence named as heavy_path. What is this heavy_path sequence?
maickrau commented 2 weeks ago

Hi,

  1. Possibly yes. It depends on if the same tangle has multiple rDNA arrays. You can try looking at the verkko graph and checking if it looks like there's multiple chromosomes entering the tangle
  2. The merged allele graph is in the temporary ONT alignment directory, by default ./tmp/ and then the per-tangle alignments are parsed out of the merged alignment file
  3. The path (column 6) can have the same morph appear multiple times in the same read (different parts of the same read to the different morph copies)
  4. The variant file is not morph-specific, instead it has variants relative to a tangle consensus sequence. The consensus is in consensus.fa but I would not recommend using the consensus or the variants, instead the morphs should be used.