shortRNAhub / shortRNA

short RNA-seq analysis package
GNU General Public License v3.0
1 stars 2 forks source link

Converting object to TreeSummarisedExperiment? #17

Closed dktanwar closed 1 year ago

dktanwar commented 5 years ago

I think it would be possible to adapt to TreeSummarisedExperiment.

Things checked:

plger commented 5 years ago

Sub-objectives:

dktanwar commented 5 years ago

Hi @plger, I do not understand the second part much.

Could you please provide an example of that table?

plger commented 5 years ago

E.g. table (I'm making up the names here) :Read1 tRNA-Ala-XYZ-1Read1 tRNA-Ala-XYZ-2Read1 tRNA-Ala-ABC-1Read2 tRNA-Sel-GCG-1Read3 ...Etc.So you'd want to place Read1 under tRNA-Ala in the tree. Is that clearer?Pierre-LucOn Jul 9, 2019 9:57 AM, Deepak Tanwar notifications@github.com wrote:Hi @plger, I do not understand the second part much. Could you please provide an example of that table?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

dktanwar commented 5 years ago

If we have a node with the following values/ leaves/fragments

tRNA-Tyr-GTA-3      
  ¦--tRNA-Tyr-GTA-3-1
  °--tRNA-Tyr-GTA-3-2

The table should be:

Node Fragment
tRNA-Tyr-GTA-3 tRNA-Tyr-GTA-3-3
tRNA-Tyr-GTA-3 tRNA-Tyr-GTA-3-4

And, the output should be:

tRNA-Tyr-GTA-3      
  ¦--tRNA-Tyr-GTA-3-1
  ¦--tRNA-Tyr-GTA-3-2
  ¦--tRNA-Tyr-GTA-3-3
  °--tRNA-Tyr-GTA-3-4
plger commented 5 years ago

either I didn't get your example, or I haven't explained well enough (sorry that my table got messed up, was on the phone)...

So if your tree is:

tRNA-Tyr-GTA-3      
  ¦--tRNA-Tyr-GTA-3-1
  °--tRNA-Tyr-GTA-3-2

And you get a table that tells you what feature each read aligns to, e.g.:

Fragment matchingLeaf
Read1 tRNA-Tyr-GTA-3-1
Read1 tRNA-Tyr-GTA-3-2
Read2 tRNA-Tyr-GTA-3-1

Then the output should be:

tRNA-Tyr-GTA-3      
  ¦--tRNA-Tyr-GTA-3-1
        °-- Read2
  ¦--tRNA-Tyr-GTA-3-2
  °--Read1

Any clearer?

dktanwar commented 5 years ago

Yes!

So, do you want to have the Reads as well in the tree along with tRNAs?

If yes, then does it make sense to put them as leaf of that tRNA?

plger commented 5 years ago

yes, that's the point.

dktanwar commented 5 years ago

ok. I will write the script and create an example

dktanwar commented 5 years ago

Done with 1 and 2. Putting the whole thing together...

dktanwar commented 5 years ago

Script: https://github.com/mansuylab/shortRNA/blob/deepak/R/tree.R

You can follow examples of each function to know how it is working. I have also pushed the test data: https://github.com/mansuylab/shortRNA/tree/deepak/test

dktanwar commented 5 years ago

For point 3

I think we would have to discuss these?

Script is too big: https://github.com/mansuylab/shortRNA/blob/ae8017fe50ec24e5545e393c5d7b6d0fb43da19f/R/assignment.R

Let me know if everything is OK so far and then how to proceed from here.

plger commented 5 years ago

So, first thanks a lot!

Regarding 1:

Regarding 2:

dktanwar commented 5 years ago

Hi @plger

You mentioned: we'd need to split "let-7a" into "let-7/let-7a"

This is what you mean by that?

 miRNA                              
  ¦--let-7                          
  ¦   ¦--let-7-7g                   
  ¦   ¦   ¦--let-7-7g-5p            
  ¦   ¦   ¦   ¦--let-7-7g-5p-A      
  ¦   ¦   ¦   ¦--let-7-7g-5p-C      
  ¦   ¦   ¦   ¦--let-7-7g-5p-G      
  ¦   ¦   ¦   °--let-7-7g-5p-T      
  ¦   ¦   °--let-7-7g-3p            
  ¦   ¦       ¦--let-7-7g-3p-A      
  ¦   ¦       ¦--let-7-7g-3p-C      
  ¦   ¦       ¦--let-7-7g-3p-G      
  ¦   ¦       °--let-7-7g-3p-T      
plger commented 5 years ago

Yes, though this would be more elegant (because consistent with the official nomenclature) :

 miRNA                              
  ¦--let-7                          
  ¦   ¦--let-7g                   
  ¦   ¦   ¦--let-7g-5p            
  ¦   ¦   ¦   ¦--let-7g-5p-A      
...
dktanwar commented 5 years ago

Yes, though this would be more elegant (because consistent with the official nomenclature) :

 miRNA                              
  ¦--let-7                          
  ¦   ¦--let-7g                   
  ¦   ¦   ¦--let-7g-5p            
  ¦   ¦   ¦   ¦--let-7g-5p-A      
...

Yes, this is what I was thinking...

dktanwar commented 5 years ago

Done!

miRNA                                   
   ¦--let                                 
   ¦   °--let-7                           
   ¦       ¦--let-7g                      
   ¦       ¦   ¦--let-7g-5p               
   ¦       ¦   ¦   ¦--let-7g-5p-A         
   ¦       ¦   ¦   ¦--let-7g-5p-C         
   ¦       ¦   ¦   ¦--let-7g-5p-G         
   ¦       ¦   ¦   °--let-7g-5p-T         
   ¦       ¦   °--let-7g-3p               
   ¦       ¦       ¦--let-7g-3p-A         
   ¦       ¦       ¦--let-7g-3p-C         
   ¦       ¦       ¦--let-7g-3p-G         
   ¦       ¦       °--let-7g-3p-T         
dktanwar commented 5 years ago

For: We'll also do the same (build trees) with gencode features, in the format: gene_biotype/gene/transcript_id

So, I think we have to do this from the gff file.

Its should be: gene_type/gene_name/transcript_name ? Or, should we use ENSEMBL IDs instead?

             seqid   source        type     start       end     score
1             chr1   HAVANA        gene   3073253   3074322        NA
2             chr1   HAVANA  transcript   3073253   3074322        NA
3             chr1   HAVANA        exon   3073253   3074322        NA
4             chr1  ENSEMBL        gene   3102016   3102125        NA
5             chr1  ENSEMBL  transcript   3102016   3102125        NA

             strand     phase                               ID
1                 +        NA             ENSMUSG00000102693.1
2                 +        NA             ENSMUST00000193812.1
3                 +        NA      exon:ENSMUST00000193812.1:1
4                 +        NA             ENSMUSG00000064842.1
5                 +        NA             ENSMUST00000082908.1

                     gene_id      gene_type     gene_name       level
1       ENSMUSG00000102693.1            TEC 4933401J01Rik           2
2       ENSMUSG00000102693.1            TEC 4933401J01Rik           2
3       ENSMUSG00000102693.1            TEC 4933401J01Rik           2
4       ENSMUSG00000064842.1          snRNA       Gm26206           3
5       ENSMUSG00000064842.1          snRNA       Gm26206           3

             mgi_id          havana_gene               Parent
1       MGI:1918292 OTTMUSG00000049935.1                   NA
2       MGI:1918292 OTTMUSG00000049935.1 ENSMUSG00000102693.1
3       MGI:1918292 OTTMUSG00000049935.1 ENSMUST00000193812.1
4       MGI:5455983                   NA                   NA
5       MGI:5455983                   NA ENSMUSG00000064842.1

               transcript_id transcript_type   transcript_name
1                         NA              NA                NA
2       ENSMUST00000193812.1             TEC 4933401J01Rik-201
3       ENSMUST00000193812.1             TEC 4933401J01Rik-201
4                         NA              NA                NA
5       ENSMUST00000082908.1           snRNA       Gm26206-201

        transcript_support_level             tag    havana_transcript

1                             NA              NA                   NA
2                             NA           basic OTTMUST00000127109.1
3                             NA           basic OTTMUST00000127109.1
4                             NA              NA                   NA
5                             NA           basic                   NA

        exon_number              exon_id           protein_id      ccdsid
1                NA                   NA                   NA          NA
2                NA                   NA                   NA          NA
3                 1 ENSMUSE00001343744.1                   NA          NA
4                NA                   NA                   NA          NA
5                NA                   NA                   NA          NA

                    ont
1                    NA
2                    NA
3                    NA
4                    NA
5                    NA
plger commented 5 years ago

let's do it this way: create a stable gene id from gene_id (so that ENSMUSG00000102693.1 becomes ENSMUSG00000102693), and then we do: longRNA/gene_type/stable_gene_id:gene_name/transcript_id This way we have everything (transcript_name isn't very meaningful anyway)

plger commented 5 years ago

This looks good! Can you give me an idea of how long it takes to build the longRNA tree, and how long it takes to assign a large number of reads in the tree? Beside changing the package's other functions, I think we'll have to cleanup the tree a bit, for a few things:

dktanwar commented 5 years ago

Hi @plger

Well, for longRNAs, it does take a few hours I think. I left it running. We can provide those as the data itself from the package.

Because of the computation time, I was thinking of adapting the code to run in parallel. That would give different trees for different "biotype", and then we can add them short/long

For reads assignment. I haven't tested it. But, we can do it in parallel as well (separately for longRNA, miRNA, tRNA) and then merge them. Although, the question would be if a read could possibly assign to, for example, miRNA and tRNA both.

For "mature miRNAs", and "miRNA precursors", how do you want the coordinates to be added? An example, please.

Yes, I will add data from all the papers that I have.

dktanwar commented 5 years ago

All databases are downloaded here: /mnt/IM/projects/software/shortRNA/test/Mus_musculus

I will make the trees!

dktanwar commented 5 years ago

Hi @plger , so, it might mot make sense to use tree for piRNAs, rRNAs (7 in total), circRNa as these are just numbered (except rRNA).

If we are going to have just 1 tree from mm10, these could be easily converted to a tree.

Can we discuss on Monday?

dktanwar commented 5 years ago

Hi @plger

Made it easier for you to review code:

https://reviewable.io/reviews/mansuylab/shortrna/20

plger commented 4 years ago

Hi, I've just pushed the long-awaited working version of the assignment (in branch plg), and I'm happy to say that it has been considerably sped up. You should now be able to start working on the rest. So, right now the procedure is as follows:


1. Extract reads and build count matrix (here we assume that the reads have already been trimmed)

suppressPackageStartupMessages({
  library(Biostrings)
})
devtools::load_all("shortRNA/")
m <- fastq2SeqCountMatrix(list.files(pattern="fastq"))
fa <- DNAStringSet(row.names(m))
names(fa) <- paste0("S",1:length(fa))
writeXStringSet(fa, "unique.fasta")

2. Build the annotation This obviously we do only once, and we build the custom genome from this:

filelist <- getAnnotationFiles("hg38", destination="annotation")
a <- prepareAnnotation(filelist = filelist, destination = "annotation",includeIsomirs = FALSE)
saveRDS(a, file="annotation/anno.rds")

3. Align the reads However you like, so long there's a bam file at the end...

shortRNAexp_align("unique.fasta", "aligned.bam", m=200, nthreads=8,
                  bowtie1index="/reference/Homo_sapiens/Ensembl/GRCh38.p13/Annotation/Release_98-2019-09-3/shortRNA/bowtie1/custom", 
                  starindex = "/reference/Homo_sapiens/Ensembl/GRCh38.p13/Annotation/Release_98-2019-09-3/genes_STARIndex",
                  bowtie1 = "/conda/bin/bowtie", star = "/conda/bin/STAR", samtools = "/conda/bin/samtools")

4. get the overlaps between alignments and annotation

a <- readRDS("annotation/anno.rds")
o <- overlapWithTx2("aligned.bam", a, nbthreads = 8)
# we check which feature types are valid with the overlaping reads
o <- getOverlapValidity(o)

5. We build the tree paths only for the features that have some overlaps At this stage we work with the pathStrings, since this is much faster to manipulate than the trees. Moreover, doing this only for the detected features will speed up things:

em <- a@elementMetadata
uids <- unique(o$transcript_id)
em <- em[which(em$transcript_id %in% uids),]
# eventually replace this with your function
pathString <- paste(em$transcript_type, em$gene_id, em$transcript_id, sep="/")
names(pathString) <- em$transcript_id

6. We assign the reads

ar <- assignReads(o, tree=pathString)

7. We extract ambiguously assigned features, and add them to the paths

w <- grep(ar$gene_id, "/ambiguous$")
pathString <- c(pathString, unique(paste(ar$gene_id[w],ar$transcript_id[w],sep="/")))

8. we build the tree and treeSummarizedExperiment

TO DO: Here we need to transform the pathString into a phylo object, and then at the individual sequences to it. Once we've got this, we can build the TSE, e.g.:

tse <- TreeSummarizedExperiment(assays=list(counts=m[row.names(ar),]),
                                rowData=ar, rowTree=tree)

Eventually we'll plug all of this into wrapper functions, but at the moment I think it's easier to keep them separate.

plger commented 4 years ago

@dktanwar If there's anything unclear (which I suppose there will be!), we can meet anytime this week

dktanwar commented 1 year ago

This is also completed!