Open MustafaElshani opened 4 years ago
Hi,
yeah, that is the plan. Any other suggestions?
Hi,
yeah, that is the plan. Any other suggestions?
Hi @smaegol
Thanks for your reply it would be really good to support tailfindr csv as it includes support for cDNA as well as RNA. As for the other suggestions I have plenty however here are some
1) Is it possible to determine Alternative Polyadyalation sites usage
2) Squiggle visualization comparing genes from different groups to visualize length and possible site changes
3) for differential analysis would it be possible to include inputs from salmon counts, DRIMseq and stageR pipelines. There is alot that can be done here
Kind Regards
Mustafa
Import lengths and/or differential lengths into isoformSwitchAnalyzeR would be cool as well. At the moment there are multiple features of an isoform that can be considered when considering the significance of isoform switches.
Hi @smaegol
2+ years on! I now have more experience with tailfindr so I'd like to go through the outputs and hopefully you can figure out how to incorporate results to nanotail if you are still maintaining it.
So tailfindr raw output file from tailfindr has the following dataframe column described here.
Filter tailfindr raw output I'm basing the following on 'cDNA' dataset. Initially the raw data is filtered as follows
INVALID
or BLANK from read_type
coloumn.FALSE
from the tail_is_valid
column.NA
from any columns. Annotate readids from sorted_bam
Continuing on !
tailfindr has an annotate function but I found it in running it within R
to be slow on big datasets so I utilised samtools view
,extracting the readid, reference, mapq and flag
I could only do the following on fastq
referenced to transcriptome, GENCODEin particular, I tried genome I have no idea how to get transcript_id etc from there
samtools view -@ 40 alignments/"${sample}"_txome_sorted.bam | awk ' BEGIN {print "read_id,transcript_id,gene_id,gene_name,length,transcript_type,mapping_quality,FLAG"} ($2 != 256) && ($2 != 272) && ($2 != 4) && ($2 < 2048) {split($1,q,"|"); split($3,r,"|"); print q[2] "," r[1] "," r[2] "," r[6] "," r[7] "," r[8] "," $5 "," $2}' > tailfindr/annotated_readid/"${sample}"_readid_annotated.csv
The above was done on multiple samples and as an output I got the following columns.
read_id,transcript_id,gene_id,gene_name,length,transcript_type,mapping_quality,FLAG
Join annotated readid with tailfindr output
So now I joined the filtered tailfindr output with "${sample}"_readid_annotated.csv
for each sample which gave me a complete annotated tail_length for each readid annotated.
I think this is the point where Nanotail could take over
The output has the following coloumns
read_id,read_type,tail_is_valid,tail_start,tail_end,samples_per_nt,tail_length,file_name,transcript_id,gene_id,gene_name,length,transcript_type,mapping_quality,FLAG,sample_name,condition
So as it currently stands can nanotail take any of the above and do ananlysis?
Plotting the tail
The output also contains file_path
to the original fast5
, I think from here you can now query the readid on that fast5_file
and draw the squiggles as shown in tailfindr
What would be really good if from shiny transcript_id
, gene_id
or gene_names
could be filtered and list all readids.
In a addition if only selected searched readid could be plotted using tailfindr function. Having ability to draw violin plots for each transcript or gene within shiny would be ace. I think if we get to here there are potentially other things that could be done
Mustafa
Hi, thanks for the comments. I have most of the features you requested working internally. Hopefully they will be released soon in this repository.
Also, what do you find the most useful from the package? The shiny app? Companion functions for reading data and plotting? Statistical calculations? With the feedback, I would then know what to focus on further.
best,
Pawel
Hi Pawel
That's great I appreciate taking the comments on board. I'm a big fan of interacting with data and having shiny offers that. I think being able to plot is great, I believe the violin plot is most fitting here. Statistics are crucial but I think you have covered that well. There are a couple of other things that I'd like to add.
Statistics and plotting at transcript levels
I believe the transcripts level is more relevant as we may have different splice variants or alternative polyadenylation sites having different tail lengths. Gene level could be an option.
Plotting polyA tails
I was trying to figure out what would be the best approach to plot the tails using tailfindr function, instead of plotting for every readid one can select by transcript_id which then pass on all the read_id
of that transcript to ont_fast5_api
'fast5_subset' to create .fast5 for that transcript and then plot the tails which shouldn't take long or use resource. I haven't figured if it is possible to overlay multiple read signals and color-code them for each group condition.
Looking forward to see what you have done already Mustafa
Hi Mustafa,
I am also a fan of interacting with data. And it's also an easier way to present data for your PI :-)
For plotting poly(A) tail lengths for less abundant transcripts, I prefer the quasirandom plot from the ggbeeswarm package, as the violin plot can be misleading in such cases. This will be integrated into the package.
As for statistics, poly(A) profiling at the isoform level is more complicated in my opinion. Even if you map the transcriptome (as I usually do), you cannot be sure that the isoform has been correctly mapped because the direct RNA reads often do not reach the 5' end of the transcript. However, alternative polyadenylation sites are much easier to process.
As for plotting the raw signals from the tails, it's not that complicated, I use rhdf5 for that purpose. I don't think there is any advantage to overlaying multiple signals, and it's more complicated because you have to somehow resquiggle, or rescale the signal. You can get a glimpse of our approach to representing raw currents in Figure 1d in our recent preprint: https://www.biorxiv.org/content/10.1101/2022.12.01.518149v2.full Similar is going to be present either in this package or in the ninetails package (https://github.com/LRB-IIMCB/ninetails). Or both approaches, we will see.
best, Pawel
Hi Will nanotail be able to support tailfindr output files ?
Kind Regards