smithlabcode / ribotricer

A tool for accurately detecting actively translating ORFs from Ribo-seq data
http://doi.org/djv4
GNU General Public License v3.0
28 stars 8 forks source link

Allow input of already p-shifted reads (bigwig) #102

Closed Roleren closed 3 weeks ago

Roleren commented 2 years ago

Is your feature request related to a problem? Please describe. This tool can not input already shifted data, makes it less useful for the community, since sometimes large bigwig files can be used, which are not possible to fit in a bam file.

Describe the solution you'd like optional input for bigwig (ORFquant supports this with very little tinkering already)

Describe alternatives you've considered One alternative is either to use another tool, or I can also just hack a solution by creating a fork that skips to the step where you have the alignments as p sites

Additional context For very large Ribo-seq files (What would have been a > 1 TB bam file etc) With bigwig, it can be loaded basically instantly.

saketkc commented 2 years ago

Thanks for the suggestions. While I do not have immediate plans to support this functionality, a PR is always welcome.

Roleren commented 2 years ago

I have already forked the repo, I think I see how it can be done.

Your final processing before ORF prediction is this: merge_read_lengths() It claims to return:

Returns
    -------
    merged_alignments: dict(dict)
                       alignments by merging all lengths

I see the dict is structured like this:

count = alignments[length][strand][(chrom, pos)] So it is a dict, by length -> strand -> then a double key on chrome and pos. In the final return, you remove the length key and merge the results of all lengths. Resulting in this dict structure: merged_alignments[strand][(chrom, pos_shifted)]

So if I understand correctly, I would need to convert the bigwig (which is split on strand) For each value in bigwig per chromosome, input the shifted read position, what you call (pos_shifted)

I then could call the main detection function: export_orf_coverages() using this dict I made from the bigwig. This is possible, and of course then the metagene plots are skiped, since it is presumed the user already did this correctly.

I think it sounds good, what do you think? Would you be interested to put it as a separate function in the cli if I make a PR, so that people could easily call if from directly from bigwig?

Would make your already quite fast tool, finish in no time.

saketkc commented 2 years ago

Your understanding is correct. I will just elaborate it below.

  1. Bigwig input mode will require two files: 1) positive strand bigwig 2) negative strand bigwig
  2. The offset can then be set to 0 (because the input bigwigs have already been shifted), this means the method could in principle be called by merge_read_lengths(alignments, psite_offsets = dict([read_length, 0] for read_length in alignments.keys()))

In principle, all that needs to be modified is a helper function that enables bigwig -> alignments dict conversion. Does this make sense?

andrewdavidsmith commented 3 weeks ago

Cosing due to inactivity.