Converting object to TreeSummarisedExperiment?

dktanwar commented 5 years ago

I think it would be possible to adapt to TreeSummarisedExperiment.

Things checked:

It is possible to subset the TSE object.
It is possible to append the TSE object.

plger commented 5 years ago

Sub-objectives:

[x] 1) function to construct a features tree.
- Input: one or more gtf file of features, eventually other annotations if needed (e.g. use the output of prepareAnnotation)
- Output: a dendrogram or phylo object Start with miRNA and tRNA:
- [x] for tRNA the parent structures can be obtained just by trimming the -* on the right of the name, e.g. tRNA-Val-CAC-1-1 belongs to tRNA-Val-CAC-1, which is a child of tRNA-Val-CAC, which is a child of tRNA-Val and finally tRNA
- [x] for miRNA, use the mirbase names and do the same thing, then grouping by miRNAs cluster (we'll eventually have an alternative seed-based tree later)
[x] 2) function to add fragments to an existing tree
- Input: a feature tree obtained from 1), as well as a table indicating to which feature each fragment maps.
- Output: a feature tree extended with the fragments as leaves (ideally phylo, since this is what TSE uses)
[ ] 3) once we have 1) and 2), we'll be able to modify the assignRead function with the following input/output:
- Input: the feature tree from 1), the overlap table (already done, though slow), and assignment rules
- Output: a TSE

dktanwar commented 5 years ago

Hi @plger, I do not understand the second part much.

Could you please provide an example of that table?

plger commented 5 years ago

E.g. table (I'm making up the names here) :Read1 tRNA-Ala-XYZ-1Read1 tRNA-Ala-XYZ-2Read1 tRNA-Ala-ABC-1Read2 tRNA-Sel-GCG-1Read3 ...Etc.So you'd want to place Read1 under tRNA-Ala in the tree. Is that clearer?Pierre-LucOn Jul 9, 2019 9:57 AM, Deepak Tanwar notifications@github.com wrote:Hi @plger, I do not understand the second part much. Could you please provide an example of that table?

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

dktanwar commented 5 years ago

If we have a node with the following values/ leaves/fragments

tRNA-Tyr-GTA-3      
  ¦--tRNA-Tyr-GTA-3-1
  °--tRNA-Tyr-GTA-3-2

The table should be:

Node	Fragment
tRNA-Tyr-GTA-3	tRNA-Tyr-GTA-3-3
tRNA-Tyr-GTA-3	tRNA-Tyr-GTA-3-4

And, the output should be:

tRNA-Tyr-GTA-3      
  ¦--tRNA-Tyr-GTA-3-1
  ¦--tRNA-Tyr-GTA-3-2
  ¦--tRNA-Tyr-GTA-3-3
  °--tRNA-Tyr-GTA-3-4

plger commented 5 years ago

either I didn't get your example, or I haven't explained well enough (sorry that my table got messed up, was on the phone)...

So if your tree is:

tRNA-Tyr-GTA-3      
  ¦--tRNA-Tyr-GTA-3-1
  °--tRNA-Tyr-GTA-3-2

And you get a table that tells you what feature each read aligns to, e.g.:

Fragment	matchingLeaf
Read1	tRNA-Tyr-GTA-3-1
Read1	tRNA-Tyr-GTA-3-2
Read2	tRNA-Tyr-GTA-3-1

Then the output should be:

tRNA-Tyr-GTA-3      
  ¦--tRNA-Tyr-GTA-3-1
        °-- Read2
  ¦--tRNA-Tyr-GTA-3-2
  °--Read1

Any clearer?

dktanwar commented 5 years ago

Yes!

So, do you want to have the Reads as well in the tree along with tRNAs?

If yes, then does it make sense to put them as leaf of that tRNA?

plger commented 5 years ago

yes, that's the point.

dktanwar commented 5 years ago

ok. I will write the script and create an example

dktanwar commented 5 years ago

Done with 1 and 2. Putting the whole thing together...

dktanwar commented 5 years ago

Script: https://github.com/mansuylab/shortRNA/blob/deepak/R/tree.R

You can follow examples of each function to know how it is working. I have also pushed the test data: https://github.com/mansuylab/shortRNA/tree/deepak/test

dktanwar commented 5 years ago

For point 3

I think we would have to discuss these?

Script is too big: https://github.com/mansuylab/shortRNA/blob/ae8017fe50ec24e5545e393c5d7b6d0fb43da19f/R/assignment.R

Let me know if everything is OK so far and then how to proceed from here.

plger commented 5 years ago

So, first thanks a lot!

Regarding 1:

we don't really need to keep the sequence of the features (data.tree isn't particularly fast, so I'd say let's keep it minimal!), and one could simplify (I'll email you a version).
I also forgot to mention, but we'd need to split "let-7a" into "let-7/let-7a"...
We'll also do the same (build trees) with gencode features, in the format: gene_biotype/gene/transcript_id

Regarding 2:

when you add a read, you can just add the sequence (as the name of the node)
the function throws an error when all the features are found

when a read maps to two things, your function assigns it to both. What we want is that it assigns it to the lowest common node of the two. For example:

testDF <- data.frame( 
Reads=c("R1","R1","R2","R3"), 
Features=c("tRNA-Ala-AGC-11-1", "tRNA-Ala-AGC-12-1", "tRNA-Ala-AGC-11-1", "non-existing-feature")
)
addReadsFeatures(tree, testDF)

gives this:

2    ¦--tRNA-Ala-AGC             
3    ¦   ¦--tRNA-Ala-AGC-1       
4    ¦   ¦   °--tRNA-Ala-AGC-1-1 
5    ¦   ¦--tRNA-Ala-AGC-10      
6    ¦   ¦   °--tRNA-Ala-AGC-10-1
7    ¦   ¦--tRNA-Ala-AGC-11      
8    ¦   ¦   °--tRNA-Ala-AGC-11-1
9    ¦   ¦       ¦--R1           
10   ¦   ¦       °--R2           
11   ¦   ¦--tRNA-Ala-AGC-12      
12   ¦   ¦   °--tRNA-Ala-AGC-12-1
13   ¦   ¦       °--R1

when what we'd want is this:

¦--tRNA-Ala-AGC 
¦   ¦--R1     
¦   ¦--tRNA-Ala-AGC-1       
¦   ¦   °--tRNA-Ala-AGC-1-1 
¦   ¦--tRNA-Ala-AGC-10      
¦   ¦   °--tRNA-Ala-AGC-10-1
¦   ¦--tRNA-Ala-AGC-11      
¦   ¦   °--tRNA-Ala-AGC-11-1
¦   ¦       °--R2           
¦   ¦--tRNA-Ala-AGC-12      
¦   ¦   °--tRNA-Ala-AGC-12-1

dktanwar commented 5 years ago

Hi @plger

You mentioned: we'd need to split "let-7a" into "let-7/let-7a"

This is what you mean by that?

 miRNA                              
  ¦--let-7                          
  ¦   ¦--let-7-7g                   
  ¦   ¦   ¦--let-7-7g-5p            
  ¦   ¦   ¦   ¦--let-7-7g-5p-A      
  ¦   ¦   ¦   ¦--let-7-7g-5p-C      
  ¦   ¦   ¦   ¦--let-7-7g-5p-G      
  ¦   ¦   ¦   °--let-7-7g-5p-T      
  ¦   ¦   °--let-7-7g-3p            
  ¦   ¦       ¦--let-7-7g-3p-A      
  ¦   ¦       ¦--let-7-7g-3p-C      
  ¦   ¦       ¦--let-7-7g-3p-G      
  ¦   ¦       °--let-7-7g-3p-T

plger commented 5 years ago

Yes, though this would be more elegant (because consistent with the official nomenclature) :

 miRNA                              
  ¦--let-7                          
  ¦   ¦--let-7g                   
  ¦   ¦   ¦--let-7g-5p            
  ¦   ¦   ¦   ¦--let-7g-5p-A      
...

dktanwar commented 5 years ago

Yes, though this would be more elegant (because consistent with the official nomenclature) :
 miRNA                              
  ¦--let-7                          
  ¦   ¦--let-7g                   
  ¦   ¦   ¦--let-7g-5p            
  ¦   ¦   ¦   ¦--let-7g-5p-A      
...

Yes, this is what I was thinking...

dktanwar commented 5 years ago

Done!

miRNA                                   
   ¦--let                                 
   ¦   °--let-7                           
   ¦       ¦--let-7g                      
   ¦       ¦   ¦--let-7g-5p               
   ¦       ¦   ¦   ¦--let-7g-5p-A         
   ¦       ¦   ¦   ¦--let-7g-5p-C         
   ¦       ¦   ¦   ¦--let-7g-5p-G         
   ¦       ¦   ¦   °--let-7g-5p-T         
   ¦       ¦   °--let-7g-3p               
   ¦       ¦       ¦--let-7g-3p-A         
   ¦       ¦       ¦--let-7g-3p-C         
   ¦       ¦       ¦--let-7g-3p-G         
   ¦       ¦       °--let-7g-3p-T

dktanwar commented 5 years ago

For: We'll also do the same (build trees) with gencode features, in the format: gene_biotype/gene/transcript_id

So, I think we have to do this from the gff file.

Its should be: gene_type/gene_name/transcript_name ? Or, should we use ENSEMBL IDs instead?

             seqid   source        type     start       end     score
1             chr1   HAVANA        gene   3073253   3074322        NA
2             chr1   HAVANA  transcript   3073253   3074322        NA
3             chr1   HAVANA        exon   3073253   3074322        NA
4             chr1  ENSEMBL        gene   3102016   3102125        NA
5             chr1  ENSEMBL  transcript   3102016   3102125        NA

             strand     phase                               ID
1                 +        NA             ENSMUSG00000102693.1
2                 +        NA             ENSMUST00000193812.1
3                 +        NA      exon:ENSMUST00000193812.1:1
4                 +        NA             ENSMUSG00000064842.1
5                 +        NA             ENSMUST00000082908.1

                     gene_id      gene_type     gene_name       level
1       ENSMUSG00000102693.1            TEC 4933401J01Rik           2
2       ENSMUSG00000102693.1            TEC 4933401J01Rik           2
3       ENSMUSG00000102693.1            TEC 4933401J01Rik           2
4       ENSMUSG00000064842.1          snRNA       Gm26206           3
5       ENSMUSG00000064842.1          snRNA       Gm26206           3

             mgi_id          havana_gene               Parent
1       MGI:1918292 OTTMUSG00000049935.1                   NA
2       MGI:1918292 OTTMUSG00000049935.1 ENSMUSG00000102693.1
3       MGI:1918292 OTTMUSG00000049935.1 ENSMUST00000193812.1
4       MGI:5455983                   NA                   NA
5       MGI:5455983                   NA ENSMUSG00000064842.1

               transcript_id transcript_type   transcript_name
1                         NA              NA                NA
2       ENSMUST00000193812.1             TEC 4933401J01Rik-201
3       ENSMUST00000193812.1             TEC 4933401J01Rik-201
4                         NA              NA                NA
5       ENSMUST00000082908.1           snRNA       Gm26206-201

        transcript_support_level             tag    havana_transcript

1                             NA              NA                   NA
2                             NA           basic OTTMUST00000127109.1
3                             NA           basic OTTMUST00000127109.1
4                             NA              NA                   NA
5                             NA           basic                   NA

        exon_number              exon_id           protein_id      ccdsid
1                NA                   NA                   NA          NA
2                NA                   NA                   NA          NA
3                 1 ENSMUSE00001343744.1                   NA          NA
4                NA                   NA                   NA          NA
5                NA                   NA                   NA          NA

                    ont
1                    NA
2                    NA
3                    NA
4                    NA
5                    NA

plger commented 5 years ago

let's do it this way: create a stable gene id from gene_id (so that ENSMUSG00000102693.1 becomes ENSMUSG00000102693), and then we do: longRNA/gene_type/stable_gene_id:gene_name/transcript_id This way we have everything (transcript_name isn't very meaningful anyway)

plger commented 5 years ago

This looks good! Can you give me an idea of how long it takes to build the longRNA tree, and how long it takes to assign a large number of reads in the tree? Beside changing the package's other functions, I think we'll have to cleanup the tree a bit, for a few things:

[ ] the "long RNA" branch of the tree contains also short RNAs (e.g. snoRNAs), because the gtf contains those, so I guess these should be moved to the shortRNA branch.
[ ] either in the gencode gtf, or from mirbase, we can get the coordinates of the miRNA precursors and miRNA hairpins - these should be added in the tree, but maybe as a separate branch (e.g. miRNAs splits into "mature miRNAs", and "miRNA precursors").
[ ] Let's make sure we include all the shortRNAs included by sports and Gerstein's software, although at the moment exclude piRNAs.

dktanwar commented 5 years ago

Hi @plger

Well, for longRNAs, it does take a few hours I think. I left it running. We can provide those as the data itself from the package.

Because of the computation time, I was thinking of adapting the code to run in parallel. That would give different trees for different "biotype", and then we can add them short/long

For reads assignment. I haven't tested it. But, we can do it in parallel as well (separately for longRNA, miRNA, tRNA) and then merge them. Although, the question would be if a read could possibly assign to, for example, miRNA and tRNA both.

For "mature miRNAs", and "miRNA precursors", how do you want the coordinates to be added? An example, please.

Yes, I will add data from all the papers that I have.

dktanwar commented 5 years ago

All databases are downloaded here: /mnt/IM/projects/software/shortRNA/test/Mus_musculus

I will make the trees!

dktanwar commented 5 years ago

Hi @plger , so, it might mot make sense to use tree for piRNAs, rRNAs (7 in total), circRNa as these are just numbered (except rRNA).

If we are going to have just 1 tree from mm10, these could be easily converted to a tree.

Can we discuss on Monday?

dktanwar commented 5 years ago

Hi @plger

Made it easier for you to review code:

https://reviewable.io/reviews/mansuylab/shortrna/20

plger commented 4 years ago

Hi, I've just pushed the long-awaited working version of the assignment (in branch plg), and I'm happy to say that it has been considerably sped up. You should now be able to start working on the rest. So, right now the procedure is as follows:

1. Extract reads and build count matrix (here we assume that the reads have already been trimmed)

suppressPackageStartupMessages({
  library(Biostrings)
})
devtools::load_all("shortRNA/")
m <- fastq2SeqCountMatrix(list.files(pattern="fastq"))
fa <- DNAStringSet(row.names(m))
names(fa) <- paste0("S",1:length(fa))
writeXStringSet(fa, "unique.fasta")

2. Build the annotation This obviously we do only once, and we build the custom genome from this:

filelist <- getAnnotationFiles("hg38", destination="annotation")
a <- prepareAnnotation(filelist = filelist, destination = "annotation",includeIsomirs = FALSE)
saveRDS(a, file="annotation/anno.rds")

3. Align the reads However you like, so long there's a bam file at the end...

shortRNAexp_align("unique.fasta", "aligned.bam", m=200, nthreads=8,
                  bowtie1index="/reference/Homo_sapiens/Ensembl/GRCh38.p13/Annotation/Release_98-2019-09-3/shortRNA/bowtie1/custom", 
                  starindex = "/reference/Homo_sapiens/Ensembl/GRCh38.p13/Annotation/Release_98-2019-09-3/genes_STARIndex",
                  bowtie1 = "/conda/bin/bowtie", star = "/conda/bin/STAR", samtools = "/conda/bin/samtools")

4. get the overlaps between alignments and annotation

a <- readRDS("annotation/anno.rds")
o <- overlapWithTx2("aligned.bam", a, nbthreads = 8)
# we check which feature types are valid with the overlaping reads
o <- getOverlapValidity(o)

5. We build the tree paths only for the features that have some overlaps At this stage we work with the pathStrings, since this is much faster to manipulate than the trees. Moreover, doing this only for the detected features will speed up things:

em <- a@elementMetadata
uids <- unique(o$transcript_id)
em <- em[which(em$transcript_id %in% uids),]
# eventually replace this with your function
pathString <- paste(em$transcript_type, em$gene_id, em$transcript_id, sep="/")
names(pathString) <- em$transcript_id

6. We assign the reads

ar <- assignReads(o, tree=pathString)

7. We extract ambiguously assigned features, and add them to the paths

w <- grep(ar$gene_id, "/ambiguous$")
pathString <- c(pathString, unique(paste(ar$gene_id[w],ar$transcript_id[w],sep="/")))

8. we build the tree and treeSummarizedExperiment

TO DO: Here we need to transform the pathString into a phylo object, and then at the individual sequences to it. Once we've got this, we can build the TSE, e.g.:

tse <- TreeSummarizedExperiment(assays=list(counts=m[row.names(ar),]),
                                rowData=ar, rowTree=tree)

Eventually we'll plug all of this into wrapper functions, but at the moment I think it's easier to keep them separate.

plger commented 4 years ago

@dktanwar If there's anything unclear (which I suppose there will be!), we can meet anytime this week

dktanwar commented 1 year ago

This is also completed!

shortRNAhub / shortRNA

Converting object to TreeSummarisedExperiment? #17

Sub-objectives: