hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
190 stars 59 forks source link

LINX TSV Results Files Documentation in User Guide #91

Closed DarioS closed 4 years ago

DarioS commented 4 years ago

LINX outputs a number of interesting-looking TSV files. What's the purpose of each one and how should users interpret them correctly? Can there be a section of advice for each one in the user guide? Also, I notice that one of them has suffix linx.viral_inserts.tsv. Is there also a summary file of repeats (e.g. LINE1, SVA, Alu, etc.) output if I have used GRIDSS' gridss.AnnotateInsertedSequence with RepeatMasker database (and also human_virus.fa) previously?

p-priestley commented 4 years ago

We are refreshing our LINX pre-publication later this month. I will make sure to include detailed documentation of each table.

We only annotate single breakends with repeats. Nearly all single breakends have repetitive context for the unmapped end. The aim of annotation is to give some context about that repetitive section - the majority are centromeric, telomeric, Poly-A (normally related to LINE insertion events) or LINE regions. This information is stored directly in the main vcf output. In a small % of cancers we can also find viral insertions - and these are annotated in that viral_inserts file if found.

DarioS commented 4 years ago

Thanks, I'll read it in future. I notice LINX has a lot of functionality recently being added for LINE1. I wonder if it will also annotate Alu and SVA at a similar level of detail? I don't think that LINX automatically plot structural variants involving them, like it does if LINE1 is involved. They seem to be functionally dependent on LINE1, so could also be biologically important.

Alu "are derived from the small cytoplasmic 7SL RNA" and "... depend on LINE retrotransposons for generation of new elements." and Isofox deals with extremely high RN7SL1/RN7SL2 expression. "We find that 6 genes in particular (RN7SL2, RN7SL1, RN7SL3, RN7SL4P, RN7SL5P and RN7SK) and are highly expressed across our cohort and at variable rates - in extreme samples these can account for more than 75% of all transcripts." So, it seems like Alu aren't mobile by themselves but are if LINE1 happens to be around and some patients have the Alu source gene - RN7SL - highly expressed. I think those extreme samples which Isofox finds could be interesting in terms of Alu elements and the LINE1 helping them to get mobile.

SVA "is a composite non-coding retrotransposon that in all likelihood relies on the L1 ORF2 reverse transcriptase for its mobilization."

p-priestley commented 4 years ago

TSV descriptions are now available: https://github.com/hartwigmedical/hmftools/tree/master/sv-linx#outputs

sorry for the long delay.

DarioS commented 4 years ago

LINX classifies and extensively annotates genomic rearrangements including simple and reciprocal breaks, LINE, viral and pseudogene insertions, and complex events such as chromothripsis.

Is chromothripsis summarised in a table somewhere I can find or what should I look out for otherwise?

Also, is the viral summary table an analysis specific to LINX or is it merely parsing results from GRIDSS, such as BEALN? I am trying to figure out why about 20% or samples show COSMIC Signature 2 and13 prominently, but all of the viral_inserts.tsv are empty.

p-priestley commented 4 years ago

The documentation is fully updated to the latest not for the full algorithm on github. We label events as COMPLEX, but don't specifically call chromothripsis sorry.

Yes. LINX utilises the BEALN field to call viral insertions and will add it to viral_inserts.tsv. I am not sure why Sig 2 and 13 would be associated with a viral insertion?