Add info about duplicates to output SE

csoneson commented 4 years ago

Just thinking: would it make sense to include information about duplicate transcripts in the output SummarizedExperiment object from tximeta? I'm thinking of a situation, for example, where data has been quantified based on the same underlying transcriptome, but for one reason or another, different choices have been made regarding which among a set of duplicate transcripts to keep. In this case, it would be difficult to directly compare the quantifications since the sets of genes/transcripts in the output objects are different. Having the information about duplicate transcripts directly in the SE would simplify this.

For a transcript-level SE, perhaps one could add two columns to the rowData (something like hasDuplicate and duplicates, where the latter would be a list with the IDs of the identical transcripts).

On the gene level, I'm not sure what would be the best approach; perhaps indicate how many of the transcripts are duplicates of others, as well as the gene/transcript IDs of these. It's a bit more complicated here also since a gene can be retained even if some of its transcripts have been excluded as duplicates.

mikelove commented 4 years ago

This seems very useful.

One note is that the duplicate txp info is contained in the index which is not (assumed to be) accessible via tximeta. So I could add this as an option, and that would either require the user to specify the duplicates file, or it would trigger the downloading of the FASTA as is now being done in cleanDuplicateTxps. Just wanted to make it clear that it's a slow process if the duplicates file is not on hand. Thoughts?

Yes, agree it would be straightforward with txp level following your spec above. For gene level it's complicated. I think that showing the number of txps from the GTF that were marked as duplicate in the rowData and then providing their IDs is useful, and then further processing are possible by the user making use of the TxDb.

csoneson commented 4 years ago

So I could add this as an option, and that would either require the user to specify the duplicates file, or it would trigger the downloading of the FASTA as is now being done in cleanDuplicateTxps

Yes, I think this makes sense! I can't really think of a general way around this. Btw, I noticed that while the SeqHash is the same regardless of whether or not duplicates are filtered out, the NameHash changes - if there's relevant information in there about the duplicate removal perhaps that could be cached too?

mikelove commented 4 years ago

@rob-p that's a good point that Charlotte has. So I'm in favor of SeqHash being the full file obtained from source (e.g. Gencode) for tximeta purposes, but right now, meta_info.json doesn't indicate whether keepDuplicates was used or not? thoughts?

rob-p commented 4 years ago

@mikelove & @csoneson,

These are great points. What would be the best way to handle this? Should we record the status of the --keepDuplicates flag and propagate it to the meta_info.json for each quant? I have no preference for whether or not the SeqHash should differ when discarding duplicates or not, but I suppose having it not differ would be most consistent with the seq hash. One thing to note is that the specific duplicate collapsing files are recorded in the index directory (not the quant dir). I don't think propagating that file to every quant directory is ideal. Do we need the specific collapse structure? Any ideas how we should handle that?

mikelove commented 4 years ago

I think propagating keepDuplicates to meta_info.json works.

What we were discussing above is that tximeta can work out the duplicates if either the user specifies the index directory, or I can infer them by downloading the FASTA (slow, last resort).

rob-p commented 4 years ago

@csoneson & @mikelove,

Commit https://github.com/COMBINE-lab/salmon/commit/09c50d3675e193c6b57137b426c6170717a7baba, along with the corresponding commit to pufferfish addresses this. It propagates a keep_duplicates entry to meta_info.json. The behavior is as follows. If salmon was run in alignment mode, then this cannot be known and the keep_duplicates key is not contained in meta_info.json (this is also the behavior if one runs with an old index that doesn't propagate this info). Otherwise, if running in selective-alignment mode with the new version, the keep_duplicates field is a boolean field that is true if --keepDuplicates was passed to the salmon index command and false otherwise.

mikelove commented 4 years ago

Thanks Rob!

Meanwhile, I'm working on the tximeta functionality. I already had some code in there that would stash a table in the BiocFileCache that had duplicate info. It's not exactly what we need, so i'm going to repurpose that code and that table to give us the info we need, and then by using BFC it only needs to access the FASTA once. I'll do some speed testing, but I may go with just creating this table from FASTA no matter what so the behavior is consistent whether or not the index is available.

mikelove commented 4 years ago

Starting working on markDuplicateTxps and version 1.5.18 can be tested: c55856f

I ended up breaking cleanDuplicateTxps and will work on that tomorrow.

mikelove commented 4 years ago

v1.5.19 should fix both mark and clean duplicates, and these can be used in combination (dup marking comes after dup cleaning was attempted): d28131a

I tested with Ensembl 99 human a little bit but not really exhaustively. I should make sure there's not a regression on the clean dup code wrt tximeta 1.4.

mikelove commented 4 years ago

Ok I see no regressions wrt v1.4, so I'm gonna close but feel free to re-open if you see an issue.

mikelove commented 4 years ago

Oh and one more detail: at the gene level you will get numDupSets which is the number of transcript duplicate sets. I didn't see a nice way to package up the actual duplicates to the gene level (you'd want to know the duplicate sets so it would be a list of lists of characters along the genes). But then you could refer back to mcols(se) to find out the actual IDs.

Working on this feature made me think of this: https://github.com/mikelove/tximeta/issues/27

csoneson commented 4 years ago

Nice! Wondering if

mcols(txps)$hasDuplicate <- FALSE
mcols(txps)$duplicates <- CharacterList(as.list(rep("",length(txps))))

(in https://github.com/mikelove/tximeta/blob/d28131a60d347a2be61c5a6f15f65390e9dc8cdb/R/tximeta.R#L401-L403) should be outside the check for if (length(duplicates) > 0). It could be confusing if the user asks for duplicate information and nothing is returned in the output.

csoneson commented 4 years ago

I also think that https://github.com/mikelove/tximeta/blob/d28131a60d347a2be61c5a6f15f65390e9dc8cdb/R/duplicateTxps.R#L23 should rather split at the first " "; the names returned with e.g. the example data used in the first vignette example (Ensembl annotation) are of the form "FBtr0070129 cdna chromosome:BDGP6"

mikelove commented 4 years ago

Thanks for catching these, I think I've addressed them in 6a6c33f

In particular, I know check for human/mouse or not from Ensembl. The first two require split on "." to match with the GTF, but others I think split on " " is needed as you point out. I also resolve the other comment, and I now write out a message on the num of duplicate sets found during marking.

thelovelab / tximeta

Add info about duplicates to output SE #26