giesselmann / STRique

Nanopore raw signal repeat detection pipeline
MIT License
45 stars 10 forks source link

will Error parsing alignment affect results? #7

Closed liutiming closed 4 years ago

liutiming commented 5 years ago

Hi thanks again for the tool. Error parsing alignment appears and followed by unaligned reads as following:

936e3a5e-548e-482f-ae61-47e9ba16c2b0 4 * 0 0 ** 0 0 ......

Will such errors affect the results? I have this question because the estimation of the repeat number is much smaller than the real value: although the real value is only around 40 repeats, result shows 4 for all fast5 files. This is only the case when I set the flanking region in the config file to be 50bp. When I set flanking region to be 250bp as recommended in another issue, all results show 0. The reads are generated by sequencing after PCR.

giesselmann commented 5 years ago

Hey, not sure if I understand the question correctly. The mentioned error occurs, if reads are unmapped. These can currently not be analyzed by STRique. The ouput shown is the line from the SAM file which could not be parsed. The '4' in column two is the SAM-flag for unampped reads, not the repeat count. I would recommend to only pipe mapped reads into STRique:

samtools view -F 2308 some_reads.bam | python3 scripts/STRique.py count ...

If you inspect your alignment in a genome browser, do you get mapped reads on target? If the repeat is longer than the one in the reference, and the PCR primers are close to it, I could imagine only few get successfully mapped.

liutiming commented 5 years ago

Hi thanks for the reply! I understand that the 4 in the line I mentioned just now is a flag that states the read is unmapped. The 4 and 0 I was referring to in my first post is as following, as I extracted from the output file: when flanking region is 50bp long: 83d02ea2-a5db-4b40-a74b-c2d99df57b3b target - 4 3.942093445824795 3.942093445824795 -2097.6609384208264 1969 0 - when flanking region is 250bp long: 83d02ea2-a5db-4b40-a74b-c2d99df57b3b target - 0 3.3857357183359733 3.3857357183359733 0 4377 0 -

I am able to get some approximate repeat counts with RepeatHMM and I am able to extract reads from the sam file at that region so I think the mapping is reasonably ok. I am wondering why will flanking length cause a change? Does the length of the whole read must be larger than length of repeat + length of the flanking region?

giesselmann commented 5 years ago

Okay, last question first: Yes, the read should contain the entire flanking sequences, otherwise the alignment and detection of the target in the raw signal will fail. If you used a PCR I would suggest to take the maximal sequence between PCR primer to repeat. The first line looks like a reasonable output for me, 3.9 scores for the prefix/suffix alignment is good. There's a tool hdf5view to directly open the fast5 files from ONT, can you manually check this read? A repeat count of 40 is definitely visible in the raw signal.

liutiming commented 5 years ago

Great thanks! I will check. Actually I looked at some fast5 files before using some hdf tools and realised that the row signal is stored as numbers. Can I ask how I can interpret where is the repeat from there? Is there a tool to visualise the numbers as well? Thanks!!

giesselmann commented 5 years ago

Couldn't find a download for hdf5view, but this one is doing the job: https://support.hdfgroup.org/projects/compass/download.html In the STRique repo under 'data' you find an example file. If you open it, navigate to the 'Signal' Dataset, mark the column and in the top-right corner you find a plot button. The expansion is usally a very monotonic signal as in the attached image: image

liutiming commented 5 years ago

Great thank you very much! I will try now.

liutiming commented 5 years ago

I think it will take a while to build the environment so I will get back to you tomorrow. Thanks so much for the prompt reply!

liutiming commented 5 years ago

image The fast5 raw signal looks like this... can I ask what is your opinion on this? Do you think the estimation of 4 repeat number is reasonable?

Also I realised that the hdf compass only worked on a windows machine... the dependencies of the linux version clash with each other (python 2.7 vs hydroffice.bag (which requires python 3))

giesselmann commented 5 years ago

That's definitely longer than 4 repeats. Can you maybe first share the respective line from you STRique config file. And if anyhow possible, would you be able to share a single raw fast5 of one of these reads, the config and if not standard mouse/human the reference genome? You can use ONTs API (https://pypi.org/project/ont-fast5-api/) to convert a bulk-fast5 to single-fast5 and send only one. If you don't want to upload it here, you could send it to giesselmann[at]molgen.mpg.de

liutiming commented 5 years ago

Alright sent. I suspect that the issue may be due to flanking length between primer and repeat being shorter than the flanking length specified. I am not sure about this but hope it can provide some clues.