Closed Arkadiy-Garber closed 1 year ago
Any insight on this would be appreciated, as I have gone through 6 or 7 of these sorts of software and have had no luck in getting useful info. I am fairly new to ONP/methylation, so would appreciate an explanation in more layman terms. Thanks, Arkadiy
Hi please use the latest pre-release version https://github.com/hasindu2008/f5c/releases/tag/v1.2-beta then specify --pore r10
Hi Hasindu,
This seems to have worked, thanks so much for your help!!
Cheers, Arkadiy
Hi Hasindu,
Thanks again for your help on this. I was wondering whether you might help interpret the output files from this program.
I am attaching the output of the script calculate_methylation_frequency.py
, which was run on the original output from the f5c_x86_64_linux call-methylation
command.
For example, what is the difference between num_motifs_in_group and called_sites columns? What does the group_sequence represent? What do the start and end columns represent, and why are they sometimes the same base position, and sometimes a range?
Thanks for all your help! Arkadiy
Hi
Could you also please run f5c meth-freq on that output file (which does the same thing as calculate_methylation_frequency.py, but I have tested that option for f5c output).
Gotcha, here is the file from that run:
OK that result looks reasonable. perhaps that calculate_methylation_frequency.py does something weird when k-mers sizes are large which is the case for R10.
See if the description here helps on understanding those columns? https://hasindu2008.github.io/f5c/docs/commands#meth-freq It is not descriptive, if still hard to understand I will try to write up something to explain.
This issue in nanopolish also may help https://github.com/jts/nanopolish/issues/365.
Thanks Hasindu, that is indeed helpful. Here is a sample line from the output: when I use the -s flag:
['contig_1', '5455267', '5455267', '1', '33', '3', '0.091', 'split-group']
All lines have a start and end that's on the same base now. The way I interpret this is that there is 1 site that can be methylated, 33 reads mapping to that location, and 3 of those reads have methylation. Is that correct?
As an FYI, some lines still contain a sequence instead of "split-group":
['contig_1', '5455222', '5455222', '1', '3', '1', '0.333', 'TCAAGAGCCGAGGCACA']
Thanks! Arkadiy
So now, column 6 divided by column 5 should equal the methylation frequency in column 7. Yeh, that split-group comes when there were multiple CGs in the group and thus had to be split, which you do not have to worry about.
I am curious, what kind of genome is your data?
Got it, thanks!
This data comes from a bacterial genome. Not 100% sure of the phylogeny, but I think it is somewhere in the Alphaproteobacteria class.
Ah right, if you happen to have bisulphite data, you can do a correlation test. The model was tested on HG2 and gave around ~0.9 correlation [see https://hasindu2008.github.io/f5c/docs/r10train last step 11]. on HG1 also got around 0.9 correlation with bisulphite.
Will close the issue for now, feel free to reopen if you face more issues.
Thanks, Hasindu. This has been immensely helpful so far. I don't think bisulphite data exists for this dataset, but I will mention it to the lab that generated this data. What is HG1 and HG2?
HG001 (NA12878) and HG002 (NA24385) human reference genome samples.
Got it, thanks for that clarification. Do you think that this means the results for a bacterial genome are invalid? I am seeing a lot of methylation calls - a lot more than I would have expected in a bacterium.
Is there a model for bacterial genomes that I can use for prediction of methylation?
Thanks for your help, Arkadiy
Hi, I think it would depend on the 9-mer composition in the bacteria genome. For hidden Markov models, as far as I am aware, we do not need separate models for different species, as long as all the k-mers are trained.
There were 4% of 9-mers that were not too frequent in the human data we used for training; thus, such k-mers were not trained. If such k-mers are abundant in the bacteria, methylation calls related to such k-mers would not be ideal. We could make the present model more complete if we train those currently missing k-mers.
How are you evaluating the methylation calls at the moment? Are you looking at the methylation frequencies - output from meth-freq?
@Arkadiy-Garber Have you solved this issue? I recently got hold of a bacterial dataset, called methylation and see the average methylation frequency to be almost 0, which is expected as that bacteria apparently does not have any CpG methylation. For the assembly.bar16methFreq.txt you once attached, the average methylation frequency is close to 0:
cut -f 7 assembly.bar16methFreq.txt | tail -n+2 | datamash mean 1 median 1
0.03537209499969 0
@hasindu2008 thanks for following up! No, this issue has not yet been resolved. Yes, am looking at the output from meth-freq.
Glad you had a chance to get a hold of a bacterial dataset. Do your results suggest that false positives should be rare?
If you are interested, I'd be happy to send you my dataset. I think that might be of benefit to both of us, since you clearly know what you are doing, and I'm still figuring these things out. Let me know, happy to send something via email or Dropbox.
Thanks again for all your help! Arkadiy
Yes sure. If you could upload some data in fast5 or blow5 format, I can have a look.
@Arkadiy-Garber Let me know if you want me to have a look on this.
Hi @hasindu2008, yes! That would be great - sorry for the radio silence. Hectic schedule lately, but I would greatly appreciate. What would be a good way to get you the data (it will be on the order of multiple Gb).
Also, would you like to raw multi-FAST5s derived from the R10 gridion run, or the single-FAST5s that I re-generated using multi_to_single_fast5?
My 2 pipelines are attached. I got different results from these pipelines, and I don't know why...
@Arkadiy-Garber
Do you have the file assembly.bar12meth.tsv
that you generated?
Otherwise, Raw multi-FAST5 from R10 gridion along with the basecalls (combined.fastq), assembly.sorted.bam and assembly.fasta in your pipeline_1.txt would be helpful.
There are two reasons I can think of why the results from the two pipelines are different
~/bin/nanopolish/scripts/calculate_methylation_frequency.py assembly.bar12meth.tsv > assembly.bar12methFreq.tsv
in your pipelines. calculate_methylation_frequency.py
does not work for R10 models as intended. You should be using f5c meth-freq -i assembly.bar12meth.tsv -s > assembly.bar12methFreq.tsv
instead. This could explain why you are seeing too high methylation frequencies for bacteria.Thanks, Hasindu. those files (raw multi-FAST5 from R10 gridion along with the basecalls (combined.fastq), assembly.sorted.bam and assembly.fasta in your pipeline_1.txt) should be in the Box folder that I shared with you on May 7th.
I'll share the assembly.bar12meth.tsv files shortly, thanks!
I ran the following commands on the combined.fastq file you already provided. I converted your raw multi-FAST5 to BLOW5 format for my convenience, but that does not affect the result.
f5c index combined.fastq --slow5 merged.blow5
minimap2 -ax map-ont assembly.fasta combined.fastq --secondary=no -t20 | samtools sort - -o combined_fastq.bam && samtools index combined_fastq.bam
f5c call-methylation --slow5 merged.blow5 -r combined.fastq -b combined_fastq.bam -g assembly.fasta -t 16 --pore r10 > combined_fastq_meth.tsv
f5c meth-freq -i combined_fastq_meth.tsv -s -o combined_fastq_meth_freq.tsv
cut -f 7 combined_fastq_meth_freq.tsv | tail -n+2 | datamash mean 1 median 1 sstdev 1
0.025360427328806 0 0.055755034577387
The mean, median and sstdev of the methylation frequency is 0.025360427328806, 0, 0.055755034577387 respectively which is pretty low. So what exactly is the problem - you mentioned that you are getting too high methylation frequencies?
Hi Hasindu,
Thank you for running that, and I appreciate you sharing the results.
I guess what I am confused about is how to interpret those numbers, as well as the output of the meth-freq files. That methylation frequency (0.025) does indeed sound low, but aren't there many methylation calls in the meth-freq (unless I am not interpreting them correctly). Based on your expertise, would you say this data looks valid, especially since the model used for this bacterial analysis is based on human genome data.
Thanks again for all your help, and sorry if I keep bugging with the same/similar questions. This is all still pretty confusing to me.
Cheers, Arkadiy
Also, I see you went with what sent as pipeline_2. Some of the command parameters are different than mine, and I'd be curious to hear if there are any considerations that you took, which I didn't, that might also impact the output.
As far as getting a profile of methylation patters (e.g. which genes are methylated vs which are not) and comparing across samples, are there any other considerations that you would suggest taking?
Thanks! Arkadiy
meth_freq.tsv contains a row for each CpG site it encountered in the genomic region - irrespective of whether methylated or unmethylated. Each row tells how many reads there were covering the CpG site and how many out of them were methylated, and based on that the methylation frequency for that CpG site. Yeh, this data look valid (near 0 average methylation frequency is what is expected for bacterias).
The commands:
Awesome, thanks for your feedback on this! Your help is much appreciated!!
Closing the issue. Feel free to reopen if you have additional questions.
I was having this issue with nanopolish and was pointed to the fact that it does not support r10 flow cells. I then tried f5c. The output here is a little more verbose, but I am still ending up with the same error: