Closed dktanwar closed 1 year ago
[x] 1) function to construct a features tree.
prepareAnnotation
)dendrogram
or phylo
object
Start with miRNA and tRNA:-*
on the right of the name, e.g. tRNA-Val-CAC-1-1
belongs to tRNA-Val-CAC-1
, which is a child of tRNA-Val-CAC
, which is a child of tRNA-Val
and finally tRNA
[x] 2) function to add fragments to an existing tree
phylo
, since this is what TSE uses)[ ] 3) once we have 1) and 2), we'll be able to modify the assignRead
function with the following input/output:
Hi @plger, I do not understand the second part much.
Could you please provide an example of that table?
E.g. table (I'm making up the names here) :Read1 tRNA-Ala-XYZ-1Read1 tRNA-Ala-XYZ-2Read1 tRNA-Ala-ABC-1Read2 tRNA-Sel-GCG-1Read3 ...Etc.So you'd want to place Read1 under tRNA-Ala in the tree. Is that clearer?Pierre-LucOn Jul 9, 2019 9:57 AM, Deepak Tanwar notifications@github.com wrote:Hi @plger, I do not understand the second part much. Could you please provide an example of that table?
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.
If we have a node with the following values/ leaves/fragments
tRNA-Tyr-GTA-3
¦--tRNA-Tyr-GTA-3-1
°--tRNA-Tyr-GTA-3-2
The table should be:
Node | Fragment |
---|---|
tRNA-Tyr-GTA-3 | tRNA-Tyr-GTA-3-3 |
tRNA-Tyr-GTA-3 | tRNA-Tyr-GTA-3-4 |
And, the output should be:
tRNA-Tyr-GTA-3
¦--tRNA-Tyr-GTA-3-1
¦--tRNA-Tyr-GTA-3-2
¦--tRNA-Tyr-GTA-3-3
°--tRNA-Tyr-GTA-3-4
either I didn't get your example, or I haven't explained well enough (sorry that my table got messed up, was on the phone)...
So if your tree is:
tRNA-Tyr-GTA-3
¦--tRNA-Tyr-GTA-3-1
°--tRNA-Tyr-GTA-3-2
And you get a table that tells you what feature each read aligns to, e.g.:
Fragment | matchingLeaf |
---|---|
Read1 | tRNA-Tyr-GTA-3-1 |
Read1 | tRNA-Tyr-GTA-3-2 |
Read2 | tRNA-Tyr-GTA-3-1 |
Then the output should be:
tRNA-Tyr-GTA-3
¦--tRNA-Tyr-GTA-3-1
°-- Read2
¦--tRNA-Tyr-GTA-3-2
°--Read1
Any clearer?
Yes!
So, do you want to have the Reads as well in the tree along with tRNAs
?
If yes, then does it make sense to put them as leaf of that tRNA
?
yes, that's the point.
ok. I will write the script and create an example
Done with 1 and 2. Putting the whole thing together...
Script: https://github.com/mansuylab/shortRNA/blob/deepak/R/tree.R
You can follow examples of each function to know how it is working. I have also pushed the test data: https://github.com/mansuylab/shortRNA/tree/deepak/test
For point 3
I think we would have to discuss these?
Script is too big: https://github.com/mansuylab/shortRNA/blob/ae8017fe50ec24e5545e393c5d7b6d0fb43da19f/R/assignment.R
Let me know if everything is OK so far and then how to proceed from here.
So, first thanks a lot!
Regarding 1:
data.tree
isn't particularly fast, so I'd say let's keep it minimal!), and one could simplify (I'll email you a version).gene_biotype/gene/transcript_id
Regarding 2:
testDF <- data.frame(
Reads=c("R1","R1","R2","R3"),
Features=c("tRNA-Ala-AGC-11-1", "tRNA-Ala-AGC-12-1", "tRNA-Ala-AGC-11-1", "non-existing-feature")
)
addReadsFeatures(tree, testDF)
gives this:
2 ¦--tRNA-Ala-AGC
3 ¦ ¦--tRNA-Ala-AGC-1
4 ¦ ¦ °--tRNA-Ala-AGC-1-1
5 ¦ ¦--tRNA-Ala-AGC-10
6 ¦ ¦ °--tRNA-Ala-AGC-10-1
7 ¦ ¦--tRNA-Ala-AGC-11
8 ¦ ¦ °--tRNA-Ala-AGC-11-1
9 ¦ ¦ ¦--R1
10 ¦ ¦ °--R2
11 ¦ ¦--tRNA-Ala-AGC-12
12 ¦ ¦ °--tRNA-Ala-AGC-12-1
13 ¦ ¦ °--R1
when what we'd want is this:
¦--tRNA-Ala-AGC
¦ ¦--R1
¦ ¦--tRNA-Ala-AGC-1
¦ ¦ °--tRNA-Ala-AGC-1-1
¦ ¦--tRNA-Ala-AGC-10
¦ ¦ °--tRNA-Ala-AGC-10-1
¦ ¦--tRNA-Ala-AGC-11
¦ ¦ °--tRNA-Ala-AGC-11-1
¦ ¦ °--R2
¦ ¦--tRNA-Ala-AGC-12
¦ ¦ °--tRNA-Ala-AGC-12-1
Hi @plger
You mentioned: we'd need to split "let-7a" into "let-7/let-7a"
This is what you mean by that?
miRNA
¦--let-7
¦ ¦--let-7-7g
¦ ¦ ¦--let-7-7g-5p
¦ ¦ ¦ ¦--let-7-7g-5p-A
¦ ¦ ¦ ¦--let-7-7g-5p-C
¦ ¦ ¦ ¦--let-7-7g-5p-G
¦ ¦ ¦ °--let-7-7g-5p-T
¦ ¦ °--let-7-7g-3p
¦ ¦ ¦--let-7-7g-3p-A
¦ ¦ ¦--let-7-7g-3p-C
¦ ¦ ¦--let-7-7g-3p-G
¦ ¦ °--let-7-7g-3p-T
Yes, though this would be more elegant (because consistent with the official nomenclature) :
miRNA
¦--let-7
¦ ¦--let-7g
¦ ¦ ¦--let-7g-5p
¦ ¦ ¦ ¦--let-7g-5p-A
...
Yes, though this would be more elegant (because consistent with the official nomenclature) :
miRNA ¦--let-7 ¦ ¦--let-7g ¦ ¦ ¦--let-7g-5p ¦ ¦ ¦ ¦--let-7g-5p-A ...
Yes, this is what I was thinking...
Done!
miRNA
¦--let
¦ °--let-7
¦ ¦--let-7g
¦ ¦ ¦--let-7g-5p
¦ ¦ ¦ ¦--let-7g-5p-A
¦ ¦ ¦ ¦--let-7g-5p-C
¦ ¦ ¦ ¦--let-7g-5p-G
¦ ¦ ¦ °--let-7g-5p-T
¦ ¦ °--let-7g-3p
¦ ¦ ¦--let-7g-3p-A
¦ ¦ ¦--let-7g-3p-C
¦ ¦ ¦--let-7g-3p-G
¦ ¦ °--let-7g-3p-T
For: We'll also do the same (build trees) with gencode features, in the format: gene_biotype/gene/transcript_id
So, I think we have to do this from the gff
file.
Its should be: gene_type/gene_name/transcript_name
?
Or, should we use ENSEMBL IDs instead?
seqid source type start end score
1 chr1 HAVANA gene 3073253 3074322 NA
2 chr1 HAVANA transcript 3073253 3074322 NA
3 chr1 HAVANA exon 3073253 3074322 NA
4 chr1 ENSEMBL gene 3102016 3102125 NA
5 chr1 ENSEMBL transcript 3102016 3102125 NA
strand phase ID
1 + NA ENSMUSG00000102693.1
2 + NA ENSMUST00000193812.1
3 + NA exon:ENSMUST00000193812.1:1
4 + NA ENSMUSG00000064842.1
5 + NA ENSMUST00000082908.1
gene_id gene_type gene_name level
1 ENSMUSG00000102693.1 TEC 4933401J01Rik 2
2 ENSMUSG00000102693.1 TEC 4933401J01Rik 2
3 ENSMUSG00000102693.1 TEC 4933401J01Rik 2
4 ENSMUSG00000064842.1 snRNA Gm26206 3
5 ENSMUSG00000064842.1 snRNA Gm26206 3
mgi_id havana_gene Parent
1 MGI:1918292 OTTMUSG00000049935.1 NA
2 MGI:1918292 OTTMUSG00000049935.1 ENSMUSG00000102693.1
3 MGI:1918292 OTTMUSG00000049935.1 ENSMUST00000193812.1
4 MGI:5455983 NA NA
5 MGI:5455983 NA ENSMUSG00000064842.1
transcript_id transcript_type transcript_name
1 NA NA NA
2 ENSMUST00000193812.1 TEC 4933401J01Rik-201
3 ENSMUST00000193812.1 TEC 4933401J01Rik-201
4 NA NA NA
5 ENSMUST00000082908.1 snRNA Gm26206-201
transcript_support_level tag havana_transcript
1 NA NA NA
2 NA basic OTTMUST00000127109.1
3 NA basic OTTMUST00000127109.1
4 NA NA NA
5 NA basic NA
exon_number exon_id protein_id ccdsid
1 NA NA NA NA
2 NA NA NA NA
3 1 ENSMUSE00001343744.1 NA NA
4 NA NA NA NA
5 NA NA NA NA
ont
1 NA
2 NA
3 NA
4 NA
5 NA
let's do it this way:
create a stable gene id from gene_id
(so that ENSMUSG00000102693.1
becomes ENSMUSG00000102693
), and then we do:
longRNA/gene_type/stable_gene_id:gene_name/transcript_id
This way we have everything (transcript_name isn't very meaningful anyway)
This looks good! Can you give me an idea of how long it takes to build the longRNA tree, and how long it takes to assign a large number of reads in the tree? Beside changing the package's other functions, I think we'll have to cleanup the tree a bit, for a few things:
Hi @plger
Well, for longRNAs, it does take a few hours I think. I left it running. We can provide those as the data
itself from the package.
Because of the computation time, I was thinking of adapting the code to run in parallel. That would give different trees for different "biotype", and then we can add them short/long
For reads assignment. I haven't tested it. But, we can do it in parallel as well (separately for longRNA, miRNA, tRNA) and then merge them. Although, the question would be if a read could possibly assign to, for example, miRNA and tRNA both.
For "mature miRNAs", and "miRNA precursors", how do you want the coordinates to be added? An example, please.
Yes, I will add data from all the papers that I have.
All databases are downloaded here: /mnt/IM/projects/software/shortRNA/test/Mus_musculus
I will make the trees!
Hi @plger , so, it might mot make sense to use tree for piRNAs, rRNAs (7 in total), circRNa as these are just numbered (except rRNA).
If we are going to have just 1 tree from mm10, these could be easily converted to a tree.
Can we discuss on Monday?
Hi @plger
Made it easier for you to review code:
Hi,
I've just pushed the long-awaited working version of the assignment (in branch plg
), and I'm happy to say that it has been considerably sped up. You should now be able to start working on the rest.
So, right now the procedure is as follows:
1. Extract reads and build count matrix (here we assume that the reads have already been trimmed)
suppressPackageStartupMessages({
library(Biostrings)
})
devtools::load_all("shortRNA/")
m <- fastq2SeqCountMatrix(list.files(pattern="fastq"))
fa <- DNAStringSet(row.names(m))
names(fa) <- paste0("S",1:length(fa))
writeXStringSet(fa, "unique.fasta")
2. Build the annotation This obviously we do only once, and we build the custom genome from this:
filelist <- getAnnotationFiles("hg38", destination="annotation")
a <- prepareAnnotation(filelist = filelist, destination = "annotation",includeIsomirs = FALSE)
saveRDS(a, file="annotation/anno.rds")
3. Align the reads However you like, so long there's a bam file at the end...
shortRNAexp_align("unique.fasta", "aligned.bam", m=200, nthreads=8,
bowtie1index="/reference/Homo_sapiens/Ensembl/GRCh38.p13/Annotation/Release_98-2019-09-3/shortRNA/bowtie1/custom",
starindex = "/reference/Homo_sapiens/Ensembl/GRCh38.p13/Annotation/Release_98-2019-09-3/genes_STARIndex",
bowtie1 = "/conda/bin/bowtie", star = "/conda/bin/STAR", samtools = "/conda/bin/samtools")
4. get the overlaps between alignments and annotation
a <- readRDS("annotation/anno.rds")
o <- overlapWithTx2("aligned.bam", a, nbthreads = 8)
# we check which feature types are valid with the overlaping reads
o <- getOverlapValidity(o)
5. We build the tree paths only for the features that have some overlaps At this stage we work with the pathStrings, since this is much faster to manipulate than the trees. Moreover, doing this only for the detected features will speed up things:
em <- a@elementMetadata
uids <- unique(o$transcript_id)
em <- em[which(em$transcript_id %in% uids),]
# eventually replace this with your function
pathString <- paste(em$transcript_type, em$gene_id, em$transcript_id, sep="/")
names(pathString) <- em$transcript_id
6. We assign the reads
ar <- assignReads(o, tree=pathString)
7. We extract ambiguously assigned features, and add them to the paths
w <- grep(ar$gene_id, "/ambiguous$")
pathString <- c(pathString, unique(paste(ar$gene_id[w],ar$transcript_id[w],sep="/")))
8. we build the tree and treeSummarizedExperiment
TO DO: Here we need to transform the pathString
into a phylo
object, and then at the individual sequences to it.
Once we've got this, we can build the TSE, e.g.:
tse <- TreeSummarizedExperiment(assays=list(counts=m[row.names(ar),]),
rowData=ar, rowTree=tree)
Eventually we'll plug all of this into wrapper functions, but at the moment I think it's easier to keep them separate.
@dktanwar If there's anything unclear (which I suppose there will be!), we can meet anytime this week
This is also completed!
I think it would be possible to adapt to
TreeSummarisedExperiment
.Things checked:
TSE
object.TSE
object.