Closed chilampoon closed 1 month ago
Remora is not currently designed to work with spliced RNA alignments. This is not currently on the roadmap, but is certainly possible. The fix would be to this function primarily. I'd be happy to accept a PR for this, but need to ensure that it would not have any impact on standard genomic alignments.
Remora is not currently designed to work with spliced RNA alignments. This is not currently on the roadmap, but is certainly possible. The fix would be to this function primarily. I'd be happy to accept a PR for this, but need to ensure that it would not have any impact on standard genomic alignments.
Okay, I misunderstood something before but updated the understanding later on, I will open a PR soon.
I wrote a function to get the query to reference mapping, and modified something to get the query to ref coordinates mappings -> this leads to a few differences from the original ref_to_signal values (for genomic reads) but they are more or less similar... for example if I just plt.plot(ref_to_signal)
. This read's cigar string is 3S10M1D24M1I5M1I7M1D34M1D16M2I15M1I7M1D6M1I12M1I1M2I11M2I5M3I20M15S
it works for my spliced read mentioned above:
please review my PR, thank you very much!
I'm on paternity leave at the moment, so I won't be able to take a very close look at this for a little bit, but at a high level unless there is a clear reason I think any differences in produced genomic alignments will not be acceptable. If differences in genomic alignments are required to support the skip operation could you detail those here?
Got it, thanks @marcus1487. The differences come from map_ref_to_signal()
def map_ref_to_signal(*, query_to_signal, ref_to_query_knots):
return np.floor(
np.interp(
ref_to_query_knots,
np.arange(query_to_signal.size),
query_to_signal,
)
).astype(int)
if stick to it, the ref_to_signal
array with knots from my new function get_ref_coords()
is exactly the same as before for genomic reads.
However for spliced reads, it doesn't generate the correct mapping as the xp
in numpy.interp()
is not continuous but with gaps - that's why I added
query_to_ref_coords = np.interp(
np.arange(query_to_signal.size),
np.arange(ref_to_query_knots.size),
ref_to_query_knots
).astype(int)
to replace np.arange(query_to_signal.size)
.
One way I could think of is to first determine if a read has intron or not, then based on that choose to use np.arange(query_to_signal.size)
or the other mapping..
@marcus1487 I found out I don't need another interpolation in map_ref_to_signal()
at all 🤦♀️
the spliced read's ref_to_signal is now like
where the last 73 bases are soft clippings.
I then tested the genomic/unspliced reads I have (I wrapped the steps into get_manual_mappins()
- please ignore the type)
matches = []
num_processed_reads = 0
for bam_read in bam_reads_reg:
try:
mapping = get_manual_mappins(bam_read, pod5_reads[bam_read.query_name])
num_processed_reads += 1
except:
continue
io_read = Read.from_pod5_and_alignment(
pod5_read_record=pod5_reads[bam_read.query_name],
alignment_record=bam_read,
reverse_signal=True,
)
matches.append((mapping == io_read.ref_to_signal).all())
print(num_processed_reads)
np.unique(matches, return_counts=True)
66686
(array([ True]), array([66686]))
the results are all the same! sorry for the mess... all the stuff is a bit complicated. I'm still in awe that people like you can create such complex software, I am going to update the PR now
It seems that the core issue here has been resolved. Please re-open if you experience further issues.
Hi @marcus1487, I was using remora API on real direct RNA-seq data, but ran into errors from this command:
Apparently
ref_seq:2542
is the number of bases of this gene's exonic regions (i.e. seq length of this read), andmove+cigar:16978
~= ref_end - ref_start on the genomic region (i.e. the range of all exons and introns).I then tried to resolve it, for here if changes to
will get the correct lengths as it uses reference knots and query knots to generate reference to read knots. Then for map_ref_to_signal() I guess can just
then it gives
ref_to_signal
coordinate mapping. I haven't tested it further, but I can provide a toy dataset for this issue, thank you and looking forward to your comments.