How to best leverage variants information?

touala commented 1 year ago

Hello @maickrau,

I'm interested in deconvolving the main rDNA variants for the cell line we are working with. I've used ONT duplex data to build the DBG and the resulting graph looks fine. I obtained a fairly large set of potential variants (n=~1000). Is there guidelines that I can follow to best use those information for resolving at least some of the variability inter-rDNA?

Best,

Alan

maickrau commented 1 year ago

Hi, that depends on what exactly you want to do with the variants. If you want all phased variants per individual rDNA copy, ribotin doesn't do that and I don't think it will in the foreseeable future. Some limited phasing, like for example full length 18s genes, should be possible although not implemented currently. With duplex it might be possible to extract some full length phased rDNA copies but that's not implemented yet. If you want to remove possible sequencing error variants, you can filter them based on coverage.

touala commented 1 year ago

Thank you for the quick turnaround.

We have fairly long simplex and duplex reads, many that should cover 90% of the rDNA unit. I'm mainly interested in finding sequences for the (main) individual rDNA units as defined in "rDNA assembly" section of T2T paper. Is it what you meant by "phased variants per individual rDNA copy"? Otherwise phasing of consecutive rDNA units is not a priority and I doubt the datasets we have is sufficient for that.

Alan

maickrau commented 1 year ago

I see. The old scripts used for doing that in the T2T project are in a different repo: https://github.com/maickrau/rdna_resolution

The procedure basically involved using GraphAligner to correct the long ONT reads using the graph, then extracting the rDNA sequences from the corrected ONT reads, and clustering the extracted rDNA sequences based on sequence similarity. The commands for this are in https://github.com/maickrau/rdna_resolution/blob/master/resolution/script.sh lines 2006 and 2052-2144 but the documentation is very poor and you would have to adjust the commands since your ONT reads are probably not split by chromosome.

touala commented 1 year ago

Awesome, thanks for sharing. I'll need some time to parse through the main and secondary scripts to see how/what I can reuse. Do you think this is doable without having the reads split by chromosome?

maickrau commented 1 year ago

Hard to say. At least for CHM13 it did recover some of the different rDNAs per chromosome so it's possible it might successfully separate them anyways even if the reads aren't split, but it depends on the actual sequences of the rDNAs. Any small variants (SNPs) will not be enough to separate the rDNA units but larger variation might.

maickrau commented 1 year ago

The newest commit of ribotin now clusters the rDNA morphs based on ultralong ONT alignments. For ribotin-verkko use the parameter --do-ul, and for ribotin-ref use the parameter --nano /path/to/nanopore/reads.fa. This requires GraphAligner to be installed.

touala commented 1 year ago

Hi @maickrau,

Thank you for providing this new feature. I've gave it a quick try but get the following error:

$ ribotin-ref -r ./references/KY962518.1.fasta -o ribotin/ribotin_duplex_with_morphs --mbg ./softwares/MBG/bin/MBG -i run1.duplex_orig_long.fastq -i run2.duplex_orig_long.fastq --nano run1.simplex.fastq --nano run2.simplex.fastq
ribotin-ref version Branch master commit d0a229d72ab10ebe3bf66e54af72bd2651b8648a 2023-04-21 10:59:44 +0300
checking for GraphAligner
/users/user/.conda/envs/graphaligner_v1.0.17b/bin/GraphAligner
output folder: ribotin/ribotin_duplex_with_morphs
extracting HiFi/duplex reads
extracting ultralong ONT reads
running
running MBG
MBG command:
 -o ribotin/ribotin_duplex_with_morphs/graph.gfa -i ribotin/ribotin_duplex_with_morphs/hifi_reads.fa -k 101 -w 71 -a 2 -u 3 -r 15000 -R 4000 --error-masking=msat --output-sequence-paths ribotin/ribotin_duplex_with_morphs/paths.gaf --only-local-resolve 1> ribotin/ribotin_duplex_with_morphs/mbg_stdout.txt 2> ribotin/ribotin_duplex_with_morphs/mbg_stderr.txt
reading graph
getting consensus
ribotin-ref: src/ClusterHandler.cpp:347: Path getHeavyPath(const GfaGraph&): Assertion `predecessor.count(result.nodes.back()) == 1' failed.
Abort (core dumped)

Any idea? Does it rely on connections between rDNA units?

Alan

maickrau commented 1 year ago

Usually this happens if it failed to build the hifi graph for some reason. Could you check if the hifi read file is empty with ls -lht ribotin/ribotin_duplex_with_morphs/hifi_reads.fa? If it's not empty, could you upload the folder somewhere?

touala commented 1 year ago

This what I have. hifi_reads.fa seems properly formatted. Please note that mbg_stderr.txt contains the following text sh: 1: -o: not found. The MBG command in my previous message seems wrong and would explain this stderr. I'm trying to rerun it now...

ls -hla ribotin/ribotin_duplex_with_morphs/
total 3,0G
drwxrwx--- 2 user group  111 avril 21 12:17 .
drwxrwx--- 9 user group 4,0K avril 21 12:14 ..
-rw-rw---- 1 user group 188M avril 21 13:56 hifi_reads.fa
-rw-rw---- 1 user group   21 avril 21 13:59 mbg_stderr.txt
-rw-rw---- 1 user group    0 avril 21 13:31 mbg_stdout.txt
-rw-rw---- 1 user group 2,8G avril 21 13:59 ont_reads.fa

maickrau commented 1 year ago

This was a bug in parameter parsing, it failed to find MBG if it's given as a parameter. You can rerun with the latest commit and same command now.

touala commented 1 year ago

MBG command is running. Thanks for the fast turnaround, I will keep you updated.

maickrau / ribotin

How to best leverage variants information? #3