pmelsted / pizzly

Fast fusion detection using kallisto

BSD 2-Clause "Simplified" License

80 stars 10 forks source link

segmentation fault #8

Closed kreil closed 7 years ago

kreil commented 7 years ago

We successfully compiled the latest version which correctly displays the help text when run. When we try and run it on our data, however, it just dies with no comments and a segmentation fault.

How can we help debug this issue?

From what I can see it already dies in unordered_map / _Hashtable when reading in the gene transcript models from the GTF from disk.

This seems to be specific to the GTF file we use (based on the AceView annotation); because pizzly does not crash this way when we use a Gencode GTF.

Any suggestions for figuring out what is tripping it up?

Many thanks, David

gdb backtrace output below ...

Reading symbols from /bi/arch/bin/pizzly-g3gdb...done. [New LWP 17715] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `pizzly-g3gdb -k 31 --gtf human_AceView.2010_ERCC_exon.gtf.gz --cache index.cach'. Program terminated with signal SIGSEGV, Segmentation fault. (gdb) bt

0 0x000000000060451e in std::_Hashtable<std::string, std::pair<std::string const, TranscriptModel>, std::allocator<std::pair<std::string const, TranscriptModel> >, std::detail::_Select1st, std::equal_to, std::hash, std::detail::_Mod_range_hashing, std::detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::detail::_Hashtable_traits<true, false, true> >::_M_bucket_index (this=0x20, __k="2-oxoacid_dh.aAug10-unspliced", __c=16340857678545105613) at /usr/include/c++/4.9/bits/hashtable.h:614

1 0x00000000005f4a88 in std::_Hashtable<std::string, std::pair<std::string const, TranscriptModel>, std::allocator<std::pair<std::string const, TranscriptModel> >, std::detail::_Select1st, std::equal_to, std::hash, std::detail::_Mod_range_hashing, std::detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::detail::_Hashtable_traits<true, false, true> >::find (this=0x20, __k="2-oxoacid_dh.aAug10-unspliced") at /usr/include/c++/4.9/bits/hashtable.h:1303

2 0x00000000005e8941 in std::unordered_map<std::string, TranscriptModel, std::hash, std::equal_to, std::allocator<std::pair<std::string const, TranscriptModel> > >::find (this=0x20, __x="2-oxoacid_dh.aAug10-unspliced") at /usr/include/c++/4.9/bits/unordered_map.h:574

3 0x00000000005c5adc in parseGTF (transcriptome=..., gtf_fn="human_AceView.2010_ERCC_exon.gtf.gz") at /bi/home/bi-sw/src/pizzly/GeneModel.hpp:364

4 0x00000000005cfce8 in main (argc=16, argv=0x7ffd9d5adf28) at /bi/home/bi-sw/src/pizzly/main.cpp:92

kreil commented 7 years ago

From looking through parseGTF, it seems that while the parser correctly reads the transcript and gene IDs from the "exon" lines, it does not insert a new transcript or gene in the transcriptome model but only looks for an existing record to which to add. The "gene" and the "transcript" lines, however, are redundant in a GTF, so the parser should not rely on their existence. There seem to be "assert" lines that catch this situation but in the released software they do not get triggered.

When we generate and add gene and transcript lines, the file is beginning to be processed, so these lines missing evidently triggered the segmentation fault. This is not ideal. Ideally, the redundant gene and transcript lines should just be inferred and auto-inserted into the in-memory model. At the least, there should be an error message.

After adding the gene and transcript lines, we ran into another issue where we noticed that the parser seems to be a little unforgiving of trailing white-space, complaining that it cannot convert "1234 " into an integer (no quotes in the file, btw., the space is directly followed by a tab). At least here we get an error message but it would also be nicer if the parser could simply ignore trailing spaces, please.

I hope these notes are helpful during code revision!

ndaniel commented 7 years ago

Pizzly, according to its authors and the manual, works only and only with GTF files from Ensembl! See: https://github.com/pmelsted/pizzly/issues/7

pmelsted commented 7 years ago

Hi @kreil thanks for reporting this and the excellent notes. I'll have to put some more effort into the various GTF formats, for now we've been focusing on Ensembl, since this is our preferred option, but we want for it to work any annotation.

Is this the GTF file you are working with, ftp://ftp.ncbi.nih.gov/repository/acedb/ncbi_37_Aug10.human.genes/AceView.ncbi_37.genes_gff.gff.gz ? If it is not is it possible for you to share the annotation so that I can make sure the code handles the format correctly.

Thanks, Pall

kreil commented 7 years ago

Dear Pall,

Thank you for your kind note! We have just generated the missing gene and transcript lines and removed all trailing white space in numeric fields, and the GTF file is now read in nicely. I would just suggest instead of the assert lines in parseGTF to terminate with an error showing the offending line in the GTF and clarifying that a gene or transcript line is missing for that.

With the fixed GTF file, however, we still find no fusions despite expecting some (cancer cell lines), so we're still exploring the situation! If you have any suggestions at all, I would of course be grateful. I note that the kallisto output fusion.txt is about 6 million lines, with both split and pair entries. Is it likely that despite all these there is not a single fusion?

We're currently running this command line

pizzly-g3gdb -k 31 --gtf human_AceView.2010_ERCC.gtf.gz --cache index.cache.txt --align-score 2 --insert-size 330 --fasta human.AceView.2010.selected.fasta.gz --output NVS_D_4/fusion.out NVS_D_4/fusion.txt

and you can see all files in http://ala.boku.ac.at/kreil/scratch/SEQC_kallisto_fusion/

I get

Opening cached file ... loaded 26265 genes and 0 transcripts Read a total of 55828 transcripts Number of kept records 0 out of 692687

Does that look alright? E.g., is it expected that only 0 transcripts are read from the cache? We've also tried with a larger insert-size of 9000, no different result though.

Kallisto was run as

kallisto quant --bias --fusion -i ../SEQC_kallisto/ref/transcripts.idx -o $OUT_PATH -t 16

Most grateful for any comments...

With best regards, David.

On 22 May 2017 at 17:41, Pall Melsted notifications@github.com wrote:

Hi @kreil https://github.com/kreil thanks for reporting this and the excellent notes. I'll have to put some more effort into the various GTF formats, for now we've been focusing on Ensembl, since this is our preferred option, but we want for it to work any annotation.

Is this the GTF file you are working with, ftp://ftp.ncbi.nih.gov/ repository/acedb/ncbi_37_Aug10.human.genes/AceView. ncbi_37.genes_gff.gff.gz ? If it is not is it possible for you to share the annotation so that I can make sure the code handles the format correctly.

Thanks, Pall

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/8#issuecomment-303137897, or mute the thread https://github.com/notifications/unsubscribe-auth/AOe7CyVGeCQ-8ir07ngungTgWDNe6cgWks5r8ayOgaJpZM4NhtuE .

ndaniel commented 7 years ago

@kreil

Opening cached file ... loaded 26265 genes and 0 transcripts Read a total of 55828 transcripts Number of kept records 0 out of 692687

This is exactly https://github.com/pmelsted/pizzly/issues/7 Using anything else than Ensembl GTF will give no fusion genes even that Pizzly will run just fine.

kreil commented 7 years ago

Dear Daniel, Dear Pall,

Unfortunately, the Ensemble gene models are not that useful for us.

AceView covers 50% more exonic regions and 60% more CDS regions than either EnsEMBL or Gencode do.

But that aside - do you have any idea what is causing this problem? We would love to help fix it.

If we knew what pizzly actually expects, we could ensure that we reformat "our" (or any) GTF to meet its requirements. So to say, "make it look like" an EnsEMBL GTF.

So adding gene and transcript lines - done. Removing trailing spaces in numeric fields - done.

What might be the next problem or requirement, please?

We would love to help debug this, and we can also provide a "GTF sanitizer" for others to use to fix all the issues we discovered once we're through the effort.

Many thanks, David

On 22 May 2017 at 19:29, Daniel Nicorici notifications@github.com wrote:

@kreil https://github.com/kreil

Opening cached file ... loaded 26265 genes and 0 transcripts Read a total of 55828 transcripts Number of kept records 0 out of 692687

This is exactly #7 https://github.com/pmelsted/pizzly/issues/7 Using anything else the Ensembl GTF will give no fusion genes even that Pizzly will run just fine.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/8#issuecomment-303166958, or mute the thread https://github.com/notifications/unsubscribe-auth/AOe7Cw4Rk1SYofB2EnmIierqekPHSm-Uks5r8cXqgaJpZM4NhtuE .

kreil commented 7 years ago

Dear Daniel,

Where can we first see things go wrong?

Is this line still ok

Opening cached file ... loaded 26265 genes and 0 transcripts

or does that already point to the problem? (I mean the "0 transcripts".) Can we already see things go wrong in the index cache?

If it's really just a question of massaging the GTF text input to be parsed properly by parseGTF, I think we can realistically help debug this step.

If it's something hidden more deeply in the bowels of the bear, we'd need a pointer / some help to contribute meaningfully.

Best regards, David.

On 22 May 2017 at 19:29, Daniel Nicorici notifications@github.com wrote:

@kreil https://github.com/kreil

Opening cached file ... loaded 26265 genes and 0 transcripts Read a total of 55828 transcripts Number of kept records 0 out of 692687

This is exactly #7 https://github.com/pmelsted/pizzly/issues/7 Using anything else the Ensembl GTF will give no fusion genes even that Pizzly will run just fine.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/8#issuecomment-303166958, or mute the thread https://github.com/notifications/unsubscribe-auth/AOe7Cw4Rk1SYofB2EnmIierqekPHSm-Uks5r8cXqgaJpZM4NhtuE .

ndaniel commented 7 years ago

Dear David!

Unfortunately, the Ensemble gene models are not that useful for us.

That's ok!

AceView covers 50% more exonic regions and 60% more CDS regions than either EnsEMBL or Gencode do.

Covering more exonic regions and more CDS regions does not mean that a fusion caller will find better/more fusion genes. Actually, it matters more the quality of the gene annotation than the quantity of the gene annotation. For example, any fusion gene caller will have big challenges with genes which overlap in a given gene annotation because the reads mapping there will look like multi-mapping reads. There are many fusion genes which have the fusion junction in mid-introns or even have a random sequence inserted at the fusion junction and in this case even the largest exonic annotation will not help.

A good fusion caller will be able to find a fusion gene even if an exonic region is not annotated or is annotated incorrectly.

ndaniel commented 7 years ago

@kreil

Is this line still ok Opening cached file ... loaded 26265 genes and 0 transcripts

In my opinion 0 transcripts shows that something is not right.

If it's really just a question of massaging the GTF text input to be parsed properly by parseGTF, I think we can realistically help debug this step.

I guess that the only choice is modify/make/force (manually?) the AceView GTF to look exactly like a Ensembl GTF file until pizzly will support officially AceView GTF files.

kreil commented 7 years ago

Dear Daniel,

Thank you for the fast response!

On 22 May 2017 at 19:54, Daniel Nicorici notifications@github.com wrote:

Dear David!

Unfortunately, the Ensemble gene models are not that useful for us.

That's ok!

AceView covers 50% more exonic regions and 60% more CDS regions than either EnsEMBL or Gencode do.

Covering more exonic regions and more CDS regions does not mean that a fusion caller will find better/more fusion genes. Actually, it matters more the quality of the gene annotation than the quantity of the gene annotation.

That's right!

Apologies for being unclear before. We know from very large scale work that the AceView annotation is not just more comprehensive but also is more specific than Gencode and Ensembl, i.e., has more support from actually observed RNA-seq reads, including junction-specific reads. This is still correct today, even though it has not been updates since 2010 (which makes working with it an operational pain in the neck at times).

I did not want to start a discussion about the merits of one or the other annotation, however, I am just trying to find out what pizzly needs to correctly read and use a GTF file.

Do you have any idea where the problem first shows? Is it already at the parse / index cache creation stage, or just further down the road?

Best wishes, David

kreil commented 7 years ago

Ah, good, that is very valuable to know. So both genes and transcripts should be "remembered" in the cache, right? You have a non-zero number there when you work with EnsEMBL GTFs, correct?

Then already the reading/caching of the other GTF doesn't work (rather than some later processing).

So we can use that to see what's wrong and try and fix it in the input file.

Many thanks, David

On 22 May 2017 at 19:58, Daniel Nicorici notifications@github.com wrote:

@kreil https://github.com/kreil

Is this line still ok Opening cached file ... loaded 26265 genes and 0 transcripts

In my opinion *0 transcripts shows that something is not right.

If it's really just a question of massaging the GTF text input to be parsed properly by parseGTF, I think we can realistically help debug this step.

I guess that the only choice is make/force (manually?) the AceView GTF to look exactly like a Ensembl GTF file until pizzly will support officially AceView GTF files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/8#issuecomment-303174548, or mute the thread https://github.com/notifications/unsubscribe-auth/AOe7C1iojHSFlxw7AXz2GTxnSDfbVHfHks5r8czKgaJpZM4NhtuE .

ndaniel commented 7 years ago

@kreil

Ah, good, that is very valuable to know. So both genes and transcripts should be "remembered" in the cache, right? You have a non-zero number there when you work with EnsEMBL GTFs, correct?

Yes, for Ensembl GTF it is a non-zero number.

kreil commented 7 years ago

PS: Could you perhaps send me some lines from a valid index cache file that includes transcripts? Should there be gene and transcript and exon lines? We only get gene lines so far, which look as follows:

GENE CCDC38 chr12 PROTEIN - 96260341 96336757 CCDC38.cAug10;CCDC38.fAug10;CCDC38.gAug10;CCDC38.eAug10;CCDC38.aAug10;CCDC38.dAu g10;CCDC38.bAug10

so "GENE", gene name, chromosome name, gene type, strand, start, end, list of transcript names separated by ";"

Many thanks, David

On 22 May 2017 at 20:00, Dr. D. P. Kreil (Boku) David.Kreil@boku.ac.at wrote:

Ah, good, that is very valuable to know. So both genes and transcripts should be "remembered" in the cache, right? You have a non-zero number there when you work with EnsEMBL GTFs, correct?

Then already the reading/caching of the other GTF doesn't work (rather than some later processing).

So we can use that to see what's wrong and try and fix it in the input file.

Many thanks, David

On 22 May 2017 at 19:58, Daniel Nicorici notifications@github.com wrote:

@kreil https://github.com/kreil

Is this line still ok Opening cached file ... loaded 26265 genes and 0 transcripts

In my opinion *0 transcripts shows that something is not right.

If it's really just a question of massaging the GTF text input to be parsed properly by parseGTF, I think we can realistically help debug this step.

I guess that the only choice is make/force (manually?) the AceView GTF to look exactly like a Ensembl GTF file until pizzly will support officially AceView GTF files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/8#issuecomment-303174548, or mute the thread https://github.com/notifications/unsubscribe-auth/AOe7C1iojHSFlxw7AXz2GTxnSDfbVHfHks5r8czKgaJpZM4NhtuE .

ndaniel commented 7 years ago

@kreil

Sorry, I do not have kallisto and pizzly right now installed on my computer but here are the command lines which should give you everything you need. These are the command lines for a successful run of pizzly on a very small RNA-seq data set with 17 known fusions genes:

wget ftp://ftp.ensembl.org/pub/release-81/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

wget ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz

wget https://sourceforge.net/projects/fusioncatcher/files/test/reads_1.fq.gz

wget https://sourceforge.net/projects/fusioncatcher/files/test/reads_2.fq.gz

kallisto index -k 31 -i transcriptome.idx  Homo_sapiens.GRCh38.cdna.all.fa.gz

kallisto quant -i transcriptome.idx  -o output_kallisto --fusion  reads_1.fq.gz reads_2.fq.gz

pizzly -k 31 --gtf Homo_sapiens.GRCh38.81.gtf.gz --cache cache.txt  --align-score 2 --insert-size 400 --fasta Homo_sapiens.GRCh38.cdna.all.fa.gz --output output_pizzly output_kallisto/fusion.txt

I would not use for fusion finding any gene annotation (that is GTF file) which is not updated regularly because I personally had Ensembl remove two genes which were actually cancer fusion genes (i.e. if a fusion gene is annotated as one gene in a GTF file then no fusion caller will ever find it; in order to be able to call a fusion gene any fusion caller needs two different genes).

AceView ... has more support from actually observed RNA-seq reads, including junction-specific reads

That might be a problem for fusion gene calling actually. If a RNA-seq dataset from a cancer sample gets in there then many of the fusion genes from that sample will be annotated as one regular gene (instead of two separated/different genes) and any fusion caller that will be using it later will never find those fusion genes again in any other new samples. This is why the quality of the gene annotation is so important.

kreil commented 7 years ago

Dear Daniel,

Thanks for the pointer and your words of warning.

Again I think I've been too sloppy in my original choice of words. Seeing its performance (both comprehensiveness and specificity) vis-a-vis EnsEMBL and Gencode in several large-scale datasets, we chose to work with AceView as a specialist resource with manual expert curation, and all models backed by complementary evidence sources (cDNA libraries, ESTs, ...)

That said, at the end of the day we probably want to rerun our analyses with different annotations and compare results.

For that, we need to find a generic way to rewrite GTF files to please the pizzly GTF parser.

Best regards, David

On 22 May 2017 at 20:32, Daniel Nicorici notifications@github.com wrote:

@kreil https://github.com/kreil

Sorry, I do not have kallisto and pizzly right now installed on my computer but here are the command lines which should give you everything you need. These are the command lines for a successful run of pizzly on a very small RNA-seq data set with 17 known fusions genes:

wget ftp://ftp.ensembl.org/pub/release-81/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

wget ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz

wget https://sourceforge.net/projects/fusioncatcher/files/test/reads_1.fq.gz

wget https://sourceforge.net/projects/fusioncatcher/files/test/reads_2.fq.gz

kallisto index -k 31 -i transcriptome.idx Homo_sapiens.GRCh38.cdna.all.fa.gz

kallisto quant -i transcriptome.idx -o output_kallisto --fusion reads_1.fq.gz reads_2.fq.gz

pizzly -k 31 --gtf Homo_sapiens.GRCh38.81.gtf.gz --cache cache.txt --align-score 2 --insert-size 400 --fasta Homo_sapiens.GRCh38.cdna.all.fa.gz --output output_pizzly output_kallisto/fusion.txt

I would not use for fusion finding any gene annotation (that is GTF file) which is not updated regularly because I personally had Ensembl remove two genes which were actually cancer fusion genes (i.e. if a fusion genes is annotated as one gene in a GTF file then no fusion caller will every find it).

AceView ... has more support from actually observed RNA-seq reads, including junction-specific reads

That might be a problem for fusion gene calling actually. If a RNA-seq dataset from a cancer sample gets in there then many of the fusion genes from that sample will be annotated as one gene (instead of two separated gene) and any fusion caller that will be using later will never find those fusion genes again in any other new samples. This is why the quality of the gene annotation is so important.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/8#issuecomment-303183154, or mute the thread https://github.com/notifications/unsubscribe-auth/AOe7C_BfbbDOgJeaBPutTiyPexcXd_jrks5r8dTIgaJpZM4NhtuE .

ndaniel commented 7 years ago

@kreil

Seeing its performance (both comprehensiveness and specificity) vis-a-vis EnsEMBL and Gencode in several large-scale datasets, we chose to work with AceView as a specialist resource with manual expert curation, and all models backed by complementary evidence sources (cDNA libraries, ESTs, ...)

I think that is a confusion going on here. I meant gene annotation in all my posts only and only from point of view of fusion gene calling.
It might be very well be that AceView ouperforms other gene annotations on gene/transcript/exon expressions in RNA-seq dataset BUT that does not mean that it will outperform (or perform equally well) also on fusion gene calling also. Gene fusion calling is different than gene/transcript/exon expression (and different than SNP/indels calling). Regarding cDNA, ESTs there the challenge is the phenotype of the sample (is the sample from a healthy person or from a tumor patient?). All fusion gene callers assume that GTF file equals healthy person. Also the biological definition of gene (that is unit of herediy) is different than the gene definition of Ensembl/AceView/GenCode (that is based on cDNA and ESTs from healthy/not-so-healthy persons).

My point here is that the fusion callers work the best with the suggested genes annotations (for example Pizzly with Ensembl GTF) for the simple reason because the authors put most of their time in working/testing with that given specific gene annotation from the given database. Most of the fusion callers come with the suggested gene annotation for which they were tested the most. Most of the gene fusion callers are very sensitive to the gene annotation which they use (which is not the case when one does a gene/transcripts/exon expression analysis).

kreil commented 7 years ago

Ah, I see what you mean!

Yes, I understand that pizzly works with EnsEMBL because it was tested with that.

I would really like to see how it performs with alternative annotation GTFs, so I much hope we manage to rewrite generic GTFs to make the pizzly parseGTF happy!

Best wishes, David

On 22 May 2017 at 21:12, Daniel Nicorici notifications@github.com wrote:

@kreil https://github.com/kreil

Seeing its performance (both comprehensiveness and specificity) vis-a-vis EnsEMBL and Gencode in several large-scale datasets, we chose to work with AceView as a specialist resource with manual expert curation, and all models backed by complementary evidence sources (cDNA libraries, ESTs, ...)

I think that is a confusion going on here. I meant gene annotation in all my posts only and only from point of view of fusion gene calling. It might be very well be that AceView ouperforms other gene annotations on gene/transcript/exon expressions in RNA-seq dataset BUT that does not mean that it will outperform (or perform equally well) also on fusion gene calling also. Gene fusion calling is different than gene/transcript/exon expression (and different than SNP/indels calling). Regarding cDNA, ESTs there the challenge the phenotype of the sample (is the sample from a healthy person or from a tumor patient?). All fusion gene callers assume that GTF file equals healthy person. Also the biological definition of gene (that is unit of herediy) is different than the gene definition of Ensembl/AceView/GenCode (that is based on cDNA and ESTs from healthy/not-so-healthy persons).

My point here is that the fusion callers work the best with the suggested genes annotations (for example Pizzly with Ensembl GTF) for the simple reason because the authors put most of their time in working/testing with that given specific gene annotation from the given database. Most of the fusion callers come with the suggested gene annotation for which they were tested the most.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/8#issuecomment-303193133, or mute the thread https://github.com/notifications/unsubscribe-auth/AOe7C6bLRUzpg1GbEd0rAGxPNiMKGkPoks5r8d4ngaJpZM4NhtuE .

kreil commented 7 years ago

Ok, just for info, the index cache shows gene and transcript lines for both 'our' and the 'standard' GTF files,

$ cut -f1 ensemble.index.cache.txt |sort |uniq -c 65988 GENE 216133 TRANSCRIPT $ cut -f1 our.index.cache.txt |sort |uniq -c 56047 GENE 243196 TRANSCRIPT

so it's not that they are not parsed but it must be something about the content.

kreil commented 7 years ago

Hmm, for those people who have trouble with the Gencode GTF files, do the Gencode GTF GENE lines just have an ID or an ID and a name?

Lacking gene names in our GTF gave the "0 transcripts" read artefact.

So one can either add a gene name or, I would suggest, adding a default in line 275 of GeneModel.hpp

  if (model.name == "") model.name = model.id;  // provide a default

With that fix, we don't get the "0 transcripts" message.

Still, we find no fusions. So there's more to fix I guess ...

Any ideas?

Many thanks.

kreil commented 7 years ago

Dear Pall,

We have since started playing with running pizzly using Ensembl annotation just to see how this goes, and were surprised to again have no fusion events remaining after the pizzly filter. We have run other tools to look for known fusion events, which confirmed ~20 known fusions in the data, which however we do not recover using pizzly.

Now, our data are from a mix of cancer cell lines, i.e., this is a heterogeneous sample. Would that violate assumptions by pizzly? I.e., is pizzly designed to also find fusion transcripts that are only present in a sub-population of the sample? As in a sample that contains normal and cancer cells, or a sample that contains heterogeneous cancer sub-clones?

I would be grateful for your thoughts and comments!

Best regards, David

On 22 May 2017 at 18:03, Dr. D. P. Kreil (Boku) David.Kreil@boku.ac.at wrote:

Dear Pall,

Thank you for your kind note! We have just generated the missing gene and transcript lines and removed all trailing white space in numeric fields, and the GTF file is now read in nicely. I would just suggest instead of the assert lines in parseGTF to terminate with an error showing the offending line in the GTF and clarifying that a gene or transcript line is missing for that.

With the fixed GTF file, however, we still find no fusions despite expecting some (cancer cell lines), so we're still exploring the situation! If you have any suggestions at all, I would of course be grateful. I note that the kallisto output fusion.txt is about 6 million lines, with both split and pair entries. Is it likely that despite all these there is not a single fusion?

We're currently running this command line

pizzly-g3gdb -k 31 --gtf human_AceView.2010_ERCC.gtf.gz --cache index.cache.txt --align-score 2 --insert-size 330 --fasta human.AceView.2010.selected.fasta.gz --output NVS_D_4/fusion.out NVS_D_4/fusion.txt

and you can see all files in http://ala.boku.ac.at/kreil/scratch/SEQC_kallisto_fusion/

I get

Opening cached file ... loaded 26265 genes and 0 transcripts Read a total of 55828 transcripts Number of kept records 0 out of 692687

Does that look alright? E.g., is it expected that only 0 transcripts are read from the cache? We've also tried with a larger insert-size of 9000, no different result though.

Kallisto was run as

kallisto quant --bias --fusion -i ../SEQC_kallisto/ref/transcripts.idx -o $OUT_PATH -t 16

Most grateful for any comments...

With best regards, David.

On 22 May 2017 at 17:41, Pall Melsted notifications@github.com wrote:

Hi @kreil https://github.com/kreil thanks for reporting this and the excellent notes. I'll have to put some more effort into the various GTF formats, for now we've been focusing on Ensembl, since this is our preferred option, but we want for it to work any annotation.

Is this the GTF file you are working with, ftp://ftp.ncbi.nih.gov/reposit ory/acedb/ncbi_37_Aug10.human.genes/AceView.ncbi_37.genes_gff.gff.gz ? If it is not is it possible for you to share the annotation so that I can make sure the code handles the format correctly.

Thanks, Pall

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pmelsted/pizzly/issues/8#issuecomment-303137897, or mute the thread https://github.com/notifications/unsubscribe-auth/AOe7CyVGeCQ-8ir07ngungTgWDNe6cgWks5r8ayOgaJpZM4NhtuE .

pmelsted commented 7 years ago

There are no assumptions about mixture of cell lines. Let's follow up on this offline.

pmelsted commented 7 years ago

The GTF issues have been fixed in the latest release. Thanks to @kreil for identifying the problem