smaegol / nanotail

R package for visualization and exploratory analysis of Oxford Nanopore direct RNA seq based polyA predictions
GNU General Public License v3.0
9 stars 5 forks source link

Tailfindr #1

Open MustafaElshani opened 4 years ago

MustafaElshani commented 4 years ago

Hi Will nanotail be able to support tailfindr output files ?

Kind Regards

smaegol commented 4 years ago

Hi,

yeah, that is the plan. Any other suggestions?

MustafaElshani commented 4 years ago

Hi,

yeah, that is the plan. Any other suggestions?

Hi @smaegol

Thanks for your reply it would be really good to support tailfindr csv as it includes support for cDNA as well as RNA. As for the other suggestions I have plenty however here are some

1) Is it possible to determine Alternative Polyadyalation sites usage
2) Squiggle visualization comparing genes from different groups to visualize length and possible site changes 3) for differential analysis would it be possible to include inputs from salmon counts, DRIMseq and stageR pipelines. There is alot that can be done here

Kind Regards

Mustafa

callumparr commented 1 year ago

Import lengths and/or differential lengths into isoformSwitchAnalyzeR would be cool as well. At the moment there are multiple features of an isoform that can be considered when considering the significance of isoform switches.

MustafaElshani commented 1 year ago

Hi @smaegol

2+ years on! I now have more experience with tailfindr so I'd like to go through the outputs and hopefully you can figure out how to incorporate results to nanotail if you are still maintaining it.

So tailfindr raw output file from tailfindr has the following dataframe column described here.

Filter tailfindr raw output I'm basing the following on 'cDNA' dataset. Initially the raw data is filtered as follows

Annotate readids from sorted_bam Continuing on ! tailfindr has an annotate function but I found it in running it within R to be slow on big datasets so I utilised samtools view ,extracting the readid, reference, mapq and flag

I could only do the following on fastq referenced to transcriptome, GENCODEin particular, I tried genome I have no idea how to get transcript_id etc from there

samtools view -@ 40 alignments/"${sample}"_txome_sorted.bam | awk ' BEGIN {print "read_id,transcript_id,gene_id,gene_name,length,transcript_type,mapping_quality,FLAG"} ($2 != 256) && ($2 != 272) && ($2 != 4) && ($2 < 2048) {split($1,q,"|"); split($3,r,"|"); print q[2] "," r[1] "," r[2] "," r[6] "," r[7] "," r[8] "," $5 "," $2}' > tailfindr/annotated_readid/"${sample}"_readid_annotated.csv

The above was done on multiple samples and as an output I got the following columns.

read_id,transcript_id,gene_id,gene_name,length,transcript_type,mapping_quality,FLAG

Join annotated readid with tailfindr output

So now I joined the filtered tailfindr output with "${sample}"_readid_annotated.csv for each sample which gave me a complete annotated tail_length for each readid annotated.

I think this is the point where Nanotail could take over

The output has the following coloumns read_id,read_type,tail_is_valid,tail_start,tail_end,samples_per_nt,tail_length,file_name,transcript_id,gene_id,gene_name,length,transcript_type,mapping_quality,FLAG,sample_name,condition

So as it currently stands can nanotail take any of the above and do ananlysis?

Plotting the tail

The output also contains file_path to the original fast5, I think from here you can now query the readid on that fast5_file and draw the squiggles as shown in tailfindr

What would be really good if from shiny transcript_id, gene_id or gene_names could be filtered and list all readids.

In a addition if only selected searched readid could be plotted using tailfindr function. Having ability to draw violin plots for each transcript or gene within shiny would be ace. I think if we get to here there are potentially other things that could be done

Mustafa

smaegol commented 1 year ago

Hi, thanks for the comments. I have most of the features you requested working internally. Hopefully they will be released soon in this repository.

Also, what do you find the most useful from the package? The shiny app? Companion functions for reading data and plotting? Statistical calculations? With the feedback, I would then know what to focus on further.

best,

Pawel

MustafaElshani commented 1 year ago

Hi Pawel

That's great I appreciate taking the comments on board. I'm a big fan of interacting with data and having shiny offers that. I think being able to plot is great, I believe the violin plot is most fitting here. Statistics are crucial but I think you have covered that well. There are a couple of other things that I'd like to add.

Statistics and plotting at transcript levels

I believe the transcripts level is more relevant as we may have different splice variants or alternative polyadenylation sites having different tail lengths. Gene level could be an option.

Plotting polyA tails

I was trying to figure out what would be the best approach to plot the tails using tailfindr function, instead of plotting for every readid one can select by transcript_id which then pass on all the read_id of that transcript to ont_fast5_api 'fast5_subset' to create .fast5 for that transcript and then plot the tails which shouldn't take long or use resource. I haven't figured if it is possible to overlay multiple read signals and color-code them for each group condition.

Looking forward to see what you have done already Mustafa

smaegol commented 1 year ago

Hi Mustafa,

I am also a fan of interacting with data. And it's also an easier way to present data for your PI :-)

For plotting poly(A) tail lengths for less abundant transcripts, I prefer the quasirandom plot from the ggbeeswarm package, as the violin plot can be misleading in such cases. This will be integrated into the package.

As for statistics, poly(A) profiling at the isoform level is more complicated in my opinion. Even if you map the transcriptome (as I usually do), you cannot be sure that the isoform has been correctly mapped because the direct RNA reads often do not reach the 5' end of the transcript. However, alternative polyadenylation sites are much easier to process.

As for plotting the raw signals from the tails, it's not that complicated, I use rhdf5 for that purpose. I don't think there is any advantage to overlaying multiple signals, and it's more complicated because you have to somehow resquiggle, or rescale the signal. You can get a glimpse of our approach to representing raw currents in Figure 1d in our recent preprint: https://www.biorxiv.org/content/10.1101/2022.12.01.518149v2.full Similar is going to be present either in this package or in the ninetails package (https://github.com/LRB-IIMCB/ninetails). Or both approaches, we will see.

best, Pawel