giesselmann / STRique

Nanopore raw signal repeat detection pipeline
MIT License
45 stars 10 forks source link

Issues related to native DNA #22

Closed s-t-calus closed 2 years ago

s-t-calus commented 4 years ago

Hello Pay,

Just recently our group managed to sequence multiple plasmids containing 50x STRs made of tri-nucleotides. Despite of initial problems with 'config file' we managed to analyse our dataset with STRique software run on Docker platform. Results looks quite good as overall output indicated acceptable range of deviation when data visualized with whiskers-plot, moreover data looked very good after alignment and visualization with the IGV. However, the same data plotted with bar chart does not look as good as we initially thought. Question 1: is that something you would expect or we made a mistake during the analysis? Very high-amount of data generated for plasmid samples will allow us for pre- and post-filtration of data e.g. removal of extreme outliers or filtration based on prefix and suffix scores.

Nonetheless, the newest dataset generated for native DNA seems to completely fail when processed with STRique i.e. zero reads in the final output. Despite of substantial quantity of reads (>400k, Cas9-enriched) we cannot produce any significant output with STRique. Question 2: what would be your suggestions to troubleshoot it? We could shorten both prefix and suffix from 150bp down to 20-30nt, however from alignment results (SAM, minimap2) this will definitely fail once again as >95% of data is missing 5' and 3' flanking regions and our gene of interest is heavily truncated for >99% of reads. I know that STRiqe could identify methylaton patterns on the gDNA, we did not try that yet but reads in FASTA format seems to have extreme amount of errors. Question 3: do you think we may be sequencing highly/extremely-modified gDNA, which cannot be accurately basecalled or processed with STRique algorithm, have you observed something similar or heard from some other groups regarding such issues?

Kind regards Simon

giesselmann commented 4 years ago

Hi Simon, To your 1st question: I guess a bar plot is just not suitable to illustrate data with an underlying (heterogenous) distribution.

To your 2nd question: Is 400k the number of reads on target or total from the flow cell. Can you check the .bam file if reads are mapping to your target locus? I don't understand why you would shorten prefix and suffix, 150 bp is certainly needed to map begin and end of the repeat in the raw signal. How far away are your Cas9 guides from the target?

To your 3rd question: STRique doesn't do basecalling and modifications impact the signal that little, that they will not affect the repeat counting. The basecalling error rate you see in the FASTA is expected and discussed in the online methods/supplement of our paper.

Pay