suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
225 stars 50 forks source link

blacklist for mm10 #74

Closed roryk closed 3 years ago

roryk commented 4 years ago

Hi @suhrig,

How did you generate the blacklist for hg38? We have some folks looking to call fusions in mm10 and the blacklist is super helpful for filtering out the mess from the hg38 calls.

zgtman commented 4 years ago

Hi, did you check this site: http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/

roryk commented 4 years ago

Thanks @zgtman, those are blacklists for ChIP/ATAC-seq I think-- @suhrig generated a great list of common false positive fusions for hg38 that I want to generate with mm10.

suhrig commented 4 years ago

Hi @roryk,

I generated the blacklist from several collections of benign tissue and a mixed cohort of cancer patients. The collections are listed here: https://arriba.readthedocs.io/en/latest/input-files/#blacklist

Every fusion, including discarded fusions, that was detected in at least two benign samples (from different libraries) or at least 10% of the cancer samples (from different donors) was blacklisted. If you also use cancer samples to generate the list, you should make sure that the percentage is definitely higher than the most recurrent fusion you expect in the cohort. In total, I used well over 1000 samples.

In addition, there are some stringent blacklist rules for some genes that are often highly expressed and therefore prone to produce in vitro artifacts (e.g. ribosomal genes, collagens) and for genes prone to attract alignment artifacts (IG, TCR). Just unzip the list and you will see them at the top.

The key is to have genetic and tissue diversity, since different genotypes and tissues give rise to different transcript variants and artifacts.

I'm already working on a blacklist for mm10. I already have a prototype based on a few hundred samples, but wasn't satisfied with the effectiveness. I guess I need more samples and more genetic diversity (they were probably all from a single strain). I have since downloaded a few hundred more samples, but still need to process them. How urgent do you need the blacklist? If you give me a few weeks, I can send you the new list. I'm currently on vacation and the processing needs some days, so earlier would be hard to accomplish.

Regards, Sebastian

roryk commented 4 years ago

Hi Sebastian,

Thanks so much for following up-- this definitely isn't urgent at all. If there is anything I can do to help out, let me know. Enjoy your vacation!

roryk commented 4 years ago

Hi Sebastian,

I'm collecting the other resources to support mm10-- I grabbed the cytoband file and added a header to it from here:

http://hgdownload.cse.ucsc.edu/goldenpath/mm10/database/cytoBand.txt.gz

I can't seem to find a GFF3 version of the protein domains though, how did you generate that file?

suhrig commented 4 years ago

The GFF3 file was generated using the R package PBase. I have already created the files some while ago. You can find them here: mm10.zip

It could be that the draw_fusions.R script does not work with these files, however. Last time I tried, there was an error, but I have not yet found the time to dig deeper.

skchronicles commented 4 years ago

Hey @suhrig,

Thank you for creating and maintaining this awesome tool!

I am also interested in an mm10 blacklist. @kopardev and I were wondering if you had any quick updates. Please let me know if there is anything I can do to help.

Thank you again for your time. We appreciate all of your hard work.

Regards, Skyler Kuhn

suhrig commented 3 years ago

To give a quick update: The alignments of normal mouse samples are running now. This will take a couple of days. Next, I can compile the blacklist, which will also take me a few days. I could not start earlier with alignment, because I first wanted to tweak the alignment parameters for better detection of internal tandem duplications. This was more effort than expected, but totally worth it, because ITD detection is much improved now. Will keep you posted.

skchronicles commented 3 years ago

@suhrig Sounds good, thank you for the update!

suhrig commented 3 years ago

The blacklist for mm10 is ready. You can download it from here until the official release is out:

https://c.1und1.de/@854294030366802154/sN08tO_8SWGyWmtx8rdgyw

I ran some tests and sanity checks to confirm the blacklist works and is not too lenient/strict. But I don't have as much benchmarking data for mouse as I have for human. So if you notice that important fusions are removed by the blacklist, please let me know. Maybe there is a way to make it less strict.

If you use the latest Arriba release (v1.2.0) you will get tons of warnings about unknown genes/malformed ranges. This is because the blacklist contains items for viral genomes (to detect viral integration sites). Use the latest develop version of Arriba to get rid of these warnings.

skchronicles commented 3 years ago

@suhrig Thank you, this is great news! I will test it out, and I will let you know if I run into any problems.

suhrig commented 3 years ago

Dear @skchronicles and @roryk,

Release 2.0.0 is out, including the blacklist for mm10. In the end, I had to make some more enhancements to the blacklist, because it was not particularly useful for RefSeq. So the blacklist I sent you earlier is only a subset of the new, official blacklist. I will disable the download of the old one to avoid confusion.

Regards, Sebastian

DarioS commented 3 years ago

mm39?

This means a change in chromosome coordinates, but it also means that 370 issues with the assembly have been resolved.

suhrig commented 3 years ago

I hadn't even noticed mm39 was out. :-D Thanks for the hint. I will put it on my ToDo list.