Trimming question + analysis of specific regions

nf-core / methylseq

Methylation (Bisulfite-Sequencing) analysis pipeline using Bismark or bwa-meth + MethylDackel

https://nf-co.re/methylseq

MIT License

140 stars 145 forks source link

Trimming question + analysis of specific regions #87

Open ewels opened 5 years ago

ewels commented 5 years ago

Hi all, @FelixKrueger @ewels

I tried to run the pipeline with some data, but I had no idea about how they have been produced. I know just that they are methylation data. SO when fastqc run, I saw that they have bad quality at the end. I wonder if trim galore should have an option in this pipeline to cut bases by quality (5' and 3' Trimming) instead of just raw bases.

Also, I will have some methylation data for 3 specific genes in the next days. I looked into Bismark but I didn't see any option to narrow down the search windows for methylation given eg a bed file. Searching the internet I found this approach:

https://github.com/ENCODE-DCC/dna-me-pipeline

Any thoughts?

Originally posted by @kokyriakidis in https://github.com/nf-core/methylseq/issues/85#issuecomment-476149257

ewels commented 5 years ago

Hi @kokyriakidis,

TrimGalore! already trims low-quality bases I think. Have a look at the FastQC reports from after the trimming to check the read qualities (these FastQC results are not shown in the MultiQC report, you need to look in results/trim_galore/fastqc.

FelixKrueger commented 5 years ago

Trim Galore removes poor qualities from the 3' end of sequence, but does not do this from the 5' end (as it is never really needed). If you have some special kind of BS-seq data I'd be happy to look at your FastQC report (the HTML file), to see if I have any recommendations.

On a related note, a close colleague of mine has recently written a tool to analyse raw sequencing files and try to figure out which kind of bisulfite sequencing experiment had been performed. The tool (Charades) is still kind of in alpha mode, but you are very welcome to give it a try! https://github.com/ChristelKrueger/Charades

FelixKrueger commented 5 years ago

Regarding your second question: I haven't looked at the ENCODE-DCC pipeline myself, but you could either

align the data genome-wide and just look at your genes of interest, or
get the sequence of the genes of interest (and maybe some surrounding sequence) and use that as a custom genome

The latter approach will be a a lot quicker, but it might not be as accurate for repetitive regions within your genes of interest, and you will have to adjust the positions if you want to bring it back in line with other genome coordinates.

kokyriakidis commented 5 years ago

@FelixKrueger Thank you very much! You can check the fastq in this link:

These are the initial files:
https://wetransfer.com/downloads/1b780edcfca5363f09e74202cca1ebd020190325114013/08a8a71624bbad7ecf6db365789c066520190325114013/affc1f

These are after TrimGalore with default settings:
https://wetransfer.com/downloads/a88486a3a7c15ac09042a291f60673b220190325114819/e96bc14c81defea887f22aa21369c0e820190325114819/c7af3f

FelixKrueger commented 5 years ago

Whoah, that looks quite wild indeed. Judging by the sequence composition, it doesn't look like a standard sequencing protocol, it rather looks like amplicon sequencing to me. Any chance this is targetting the PTPN22 gene, and nothing else? I would probably proceed with a standard trim_galore --paired file1 file2 command, followed by a default directional alignment, and see what you get. Cheers, Felix

kokyriakidis commented 5 years ago

It is targeting PTPN22 but also the lns-1 gene