nanoporetech / remora

Methylation/modified base calling separated from basecalling.
https://nanoporetech.com
Other
155 stars 20 forks source link

Strand-specific analysis of modified oligos #102

Closed darylgohl closed 1 year ago

darylgohl commented 1 year ago

We have generated some sequencing data from short (50 bp) oligonucleotides with strand-specific modifications and I was wondering if there is a way to use Remora to identify signatures of these modifications? Currently we have data for two different sets of samples: Sample A: A 50mer with a deaza modification on the forward strand and an unmodified reverse strand. Sample B: A 50mer with a deaza modification on the forward strand and a novel modification on the reverse strand.

I know that Remora is designed to take in canonical/unmodified and modified pod5 and bam files as inputs. We currently don't have a fully unmodified sample sequenced, but we could generate this data. My questions are: 1) Is there a way to use just a single strand (or a subset of reads corresponding to the reverse strand) as the input for the canonical sequence in Remora? Since these are oligos, we can determine which strand is which by the direction in which it was sequenced (the plus strand (containing the deaza base) is in the forward orientation and the minus strand is in the reverse complement orientation in the sequencing reads). We can see a clear effect of the deaza modification in the fastq data, where there is both an increased error rate and a decrease in Q-scores near the site of this modification for the plus strand). 2) I currently just have pod5 files and the fastq basecalls (not the mapped bam files). For each sample, there are pod5_pass and pod5_fail folders. Since it is possible that an uncharacterized DNA modification (we were attempting to sequence DNA that was modified with a large, >750 Da adduct) could have effects on read quality, would you recommend also including the pod5_fail files in this analysis, or restricting the analysis to the pod5_pass files?

Thanks for your help!

marcus1487 commented 1 year ago

I'll answer the second question first with a simple yes. I would definitely include fail pod5 for larger/ionically charged modified bases and short strands.

For the first question, it would be good to understand the goal of your analysis. Assuming you know the printed strand sequence, this should be sufficient to provide to minimap2 for mapping these reads. If you'd like to subset these reads you can simply filter the mapped BAM file to get forward or reverse strand mapping reads as desired. That being said many of the Remora commands can sort this strand/reference location out for you. For example the remora analyze plot ref_region command takes a bed file of reference regions which will be represented in the output. If you are looking to generate a Remora dataset the remora dataset prepare command takes a stranded BED file of positions at which to extract training chunks. If there is another analysis in which you are interested could you expand on the goals of this analysis?

darylgohl commented 1 year ago

Thanks Marcus! Here is some more information on the goals of our analysis (just let me know if further clarification is needed) We know that the modification in question has a specific target sequence (which is a 9 bp motif within the 50mer oligo) and for now our goal is just to see if we can detect either the presence of the deaza base or the larger modification within this motif. For this initial analysis, we want to compare:

  1. Forward, deaza-modified strand (Sample A) to reverse, unmodified strand (Sample A)
  2. Reverse, unmodified strand (Sample A), to reverse, novel modified strand (Sample B) and maybe...
  3. Forward, deaza-modified strand (Sample B) to reverse, novel modified strand (Sample B) Ultimately we'd like to establish a signature for this modification in the raw nanopore data that could potentially be used to assess more complex samples (understanding that we may need to generate additional data in other sequence contexts to accomplish this).

A few other technical questions based on your initial response: -From the response, it sounds like you can use minimap2 to generate the BAM files (I was assuming we would need to redo the basecalling and mapping with dorado). Is this correct that I can just use minimap2 with the existing fastq files? -It seems like splitting the reads into forward and reverse strands upstream of feeding them into Remora makes sense, as the cannonical/modified comparisons we want to make are strand specific (but please let me know if you see another easier way using Remora commands). -In this case, the process as I understand it, would be to:

  1. Generate a folder with all of the pass and fail pod5 files for a sample.
  2. Concatenate all of the pass and fail fastq files for a sample and use minimap2 to map to the oligo reference (I also assume I would want to trim adapters prior to mapping).
  3. Filter the BAM file into separate forward and reverse files.
  4. Use the full pod5 directory and the appropriate strand-specific BAM file as inputs into Remora.
marcus1487 commented 1 year ago

A couple of issues with this pipeline. First is that Remora currently requires a single POD5 file per sample. So you'll have to merge the pod files into one POD5 file. We are working on updating to allow for a directory of POD5s, but this is not currently implemented in the POD5 library. Second is that FASTQ basecalls do not have the move table. Minimap2 is a bit of a pain point here as minimap2 does not support uBAM input. Thus you'll have to follow the instructions here to map reads with move tags using minimap2.

Finally for the forward and reverse comparison, I'm not quite sure what you mean by this. Remora produces plots with reads from multiple groups mapped to the SAME reference sequence. I'm not sure how you would align reads mapped to the forward sequence of one contig with the reverse of another contig. If these sequences are the exact same then I would recommend making a reference to map all the strands. If the reference sequences are different, but similar over a small stretch then this is not supported natively in the Remora API. You can probably get started pretty well with the code in remora, but this would definitely be out of scope. Please let me know if there are specific bits of information that would be helpful if this is indeed your goal.

If you can map the two groups of reads to the same reference then you don't need to separate the forward and reverse into different BAM files as the reference region specified for plotting will include the strand from which to extract the reads and the associated bits of signal.

As a final point, I would start with a very small number of reads to get this analysis working before jumping to a very large run. signal plotting is really not useful over about 50 reads, so you may not even need to scale up too much.

darylgohl commented 1 year ago

Thanks! I have been unable to map these short reads using minimap2 (after trimming, the reads I am working with are only ~26 bp). However, I am able to map them with bowtie2 and I have generated bam files that I think include the move table by using the --sam-append-comment command in bowtie2. Here is what the sam/bam file looks like:

e11ab16a-d020-46ca-8046-5dbfdf80df95 16 50mer_reference 10 42 30M 0 0 CTCGCGAATATTATACGACTCACTATAGGG $%'()+35432010119AA@;:::869:<? AS:i:-2 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:1C28 YT:Z:UU BC:Z:runid=352b35ebabd48290feb1a53428bd978d57137874 read=16 ch=347 start_time=2023-07-19T12:13:28.772349-05:00 flow_cell_id=FAW88667 protocol_group_id=230719_FAW88667_Unmodified sample_id=230719_FAW88667_Unmodified barcode=barcode01 barcode_alias=barcode01 parent_read_id=e11ab16a-d020-46ca-8046-5dbfdf80df95 basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0 e7f2a250-0c0d-4bc2-8f3e-d2000da8ab72 16 50mer_reference 13 42 27M 0 0 GCGAATATTATACGACTCACTATAGGG ((',,,--/****4=====<:;<BD:5 AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:27 YT:Z:UU BC:Z:runid=352b35ebabd48290feb1a53428bd978d57137874 read=19 ch=330 start_time=2023-07-19T12:13:30.772349-05:00 flow_cell_id=FAW88667 protocol_group_id=230719_FAW88667_Unmodified sample_id=230719_FAW88667_Unmodified barcode=barcode01 barcode_alias=barcode01 parent_read_id=e7f2a250-0c0d-4bc2-8f3e-d2000da8ab72 basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0

However, I am now encountering another issue when I try to feed this data into remora. I am following the notebook here: https://github.com/nanoporetech/remora/blob/master/notebooks/metrics_api.ipynb. When I try to run io_read = io.Read.from_pod5_and_alignment(pod5_read, bam_read), I get the following error:

Traceback (most recent call last): File "", line 1, in File "/Users/darylgohl/opt/anaconda3/envs/py39/lib/python3.9/site-packages/remora/io.py", line 1570, in from_pod5_and_alignment read.add_alignment(alignment_record, reverse_signal=reverse_signal) File "/Users/darylgohl/opt/anaconda3/envs/py39/lib/python3.9/site-packages/remora/io.py", line 1540, in add_alignment self.ref_to_signal = DC.compute_ref_to_signal( File "/Users/darylgohl/opt/anaconda3/envs/py39/lib/python3.9/site-packages/remora/data_chunks.py", line 108, in compute_ref_to_signal assert query_to_signal.size == ref_to_read_knots[-1].astype(int) + 1, ( AttributeError: 'NoneType' object has no attribute 'size'

It looks like the bam_read and pod5_read variables both have content:

print(bam_read) e11ab16a-d020-46ca-8046-5dbfdf80df95 16 #0 10 42 30M CTCGCGAATATTATACGACTCACTATAGGG array('B', [3, 4, 6, 7, 8, 10, 18, 20, 19, 18, 17, 15, 16, 15, 16, 16, 24, 32, 32, 31, 26, 25, 25, 25, 23, 21, 24, 25, 27, 30])[('AS', -2), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '1C28'), ('YT', 'UU'), ('BC', 'runid=352b35ebabd48290feb1a53428bd978d57137874 read=16 ch=347 start_time=2023-07-19T12:13:28.772349-05:00 flow_cell_id=FAW88667 protocol_group_id=230719_FAW88667_Unmodified sample_id=230719_FAW88667_Unmodified barcode=barcode01 barcode_alias=barcode01 parent_read_id=e11ab16a-d020-46ca-8046-5dbfdf80df95 basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0')]

pod5_read.signal array([495, 496, 476, ..., 313, 298, 461], dtype=int16) len(pod5_read.signal) 3214

Do you have any advice on how to resolve this error?

marcus1487 commented 1 year ago

This error will be more helpful in the next release, but the reads need the MD tag to extract the reference sequence. You can use samtools calmd to add the tags.

darylgohl commented 1 year ago

Thanks, the reads have MD tags (it looks like they were already there in the original mapping, but I also ran calmd on the bam files). Here is what is read into remora for the first bam read: e11ab16a-d020-46ca-8046-5dbfdf80df95 16 #0 10 42 30M CTCGCGAATATTATACGACTCACTATAGGG array('B', [3, 4, 6, 7, 8, 10, 18, 20, 19, 18, 17, 15, 16, 15, 16, 16, 24, 32, 32, 31, 26, 25, 25, 25, 23, 21, 24, 25, 27, 30])[('AS', -2), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '1C28'), ('YT', 'UU'), ('BC', 'runid=352b35ebabd48290feb1a53428bd978d57137874 read=16 ch=347 start_time=2023-07-19T12:13:28.772349-05:00 flow_cell_id=FAW88667 protocol_group_id=230719_FAW88667_Unmodified sample_id=230719_FAW88667_Unmodified barcode=barcode01 barcode_alias=barcode01 parent_read_id=e11ab16a-d020-46ca-8046-5dbfdf80df95 basecall_model_version_id=dna_r10.4.1_e8.2_5khz_400bps_hac@v4.2.0')]

However, I am still getting the same error message when running io_read = io.Read.from_pod5_and_alignment(pod5_read, bam_read).

marcus1487 commented 1 year ago

Ah. It appears that the move table tag mv is not in this read. This needs to be output by the basecaller and transferred through the mapping software (with minimap2 this is the -y flag; not sure if bowtie2 has this option). It is likely possible to transfer this tag over, but this is certainly not a simple operation.

That will also lead to this error. As I noted we will be improving the error text in these settings in the next release, but hopefully this helps you.

darylgohl commented 1 year ago

Just a quick update here. I was able to get this data into Remora by redoing the basecalling with guppy to generate an unmapped BAM file with the move table data (my initial output from the GridION did not have this, just fastq and pod5 files). I then used bowtie2 with the --preserve-tags option to directly map the uBAM file to my reference.

marcus1487 commented 1 year ago

Great! If any issues do arise, do not hesitate to re-open this issue.