bluesky333 commented 9 years ago

Dear Andre,

Although spladder works fine with marmoset data and example data, it does not work well with human data. Here is what I get when I run spladder on human data.

Augmenting splice graphs.

Generating splice graph ...

Total genes: 54121 Total genes with alternative isoforms: 24473 Total genes alternatively spliced: 23570 Total constitutively spliced: 30551 ...done.

Loading introns from file ... Traceback (most recent call last): File "./spladder/python/spladder.py", line 215, in spladder() File "./spladder/python/spladder.py", line 177, in spladder spladder_core(CFG) File "/home/och/PROJECTs/00_RNA_seq_practice/spladder/python/modules/core/spladdercore.py", line 26, in spladder_core genes = gen_graphs(genes, CFG) File "/home/och/PROJECTs/00_RNA_seq_practice/spladder/python/modules/core/gen_graphs.py", line 82, in gen_graphs introns = get_intron_list(genes, CFG) File "/home/och/PROJECTs/00_RNA_seq_practice/spladder/python/modules/reads.py", line 370, in get_intron_list chunks = sp.array([[CFG['chrm_lookup'][x.chr], strands.index(x.strand), x.start, x.stop] for x in genes], dtype = 'int') ValueError: '.' is not in list

Thanks, Yunsoo Kim

fabiolib commented 9 years ago

Dear Yunsoo, I'm not entirely sure what the cause of your problem is, hopefully Andre can help. In the meantime, I can assure you I have run SplAdder successfully on human RNAseq data using hg19 aligned reads and the GENCODE v19 gff. Regards, Fabio

akahles commented 9 years ago

Hi Yunsoo, This looks like your annotation or alignment file contains some values that SplAdder does not expect. Could you please send me an overview of both. If you are on a *nix system you can get the contigs in a GTF/GFF by:

cut -f 1 <your_file> | sort -u

and the contigs in the BAM file by:

samtools view -H <bam_file>

Would you mind posting the result or sending it to me?

bluesky333 commented 9 years ago

Hi Andre,

Thanks for the quick response.

The following is what I got for the contigs in the GTF file. chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr17_ctg5_hap1 chr17_gl000204_random chr17_gl000205_random chr18 chr19 chr19_gl000208_random chr19_gl000209_random chr1_gl000191_random chr1_gl000192_random chr2 chr20 chr21 chr22 chr3Hi chr4 chr4_ctg9_hap1 chr4_gl000193_random chr4_gl000194_random chr5 chr6 chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3 chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6 chr6_ssto_hap7 chr7 chr7_gl000195_random chr8 chr8_gl000196_random chr9 chr9_gl000199_random chr9_gl000201_random chrM chrUn_gl000211 chrUn_gl000212 chrUn_gl000213 chrUn_gl000214 chrUn_gl000215 chrUn_gl000216 chrUn_gl000218 chrUn_gl000219 chrUn_gl000220 chrUn_gl000221 chrUn_gl000222 chrUn_gl000223 chrUn_gl000224 chrUn_gl000225 chrUn_gl000228 chrUn_gl000229 chrUn_gl000230 chrUn_gl000231 chrUn_gl000233 chrUn_gl000237 chrUn_gl000238 chrUn_gl000240 chrUn_gl000241 chrUn_gl000242 chrUn_gl000243 chrUn_gl000247 chrX chrY

This is what I have for the contigs in the BAM file. @HD VN:1.3 SO:coordinate @PG ID:STAR PN:STAR VN:STAR_2.3.0e_r291 CL:STAR --runThreadN 4 --genomeDir /home/keunsoo/Bowtie2_index/hg19_STAR --readFilesIn ERR164601_1.fastq ERR164601_2.fastq --outFileNamePrefix pERR164601 --outSAMstrandField intronMotif cl:STAR --genomeDir /home/keunsoo/Bowtie2_index/hg19_STAR --readFilesIn ERR164601_1.fastq ERR164601_2.fastq --runThreadN 4 --outSAMstrandField intronMotif --outFileNamePrefix pERR164601 @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430Dear @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566

I think the problem is with STAR. When I used marmoset data, I used bowtie2 for the alignment. However, for the human data, I used STAR for faster process. @PG comes last in bowtie2 output, while @PG comes 2nd line for the STAR output.

The marmoset bam file gave me contigs as @HD VN:1.0 SO:coordinate @SQ SN:ENA|CM000856|CM000856.1 LN:210400635 @SQ SN:ENA|CM000857|CM000857.1 LN:204313951 @SQ SN:ENA|CM000858|CM000858.1 LN:190850796 @SQ SN:ENA|CM000859|CM000859.1 LN:171630274 @SQ SN:ENA|CM000860|CM000860.1 LN:159171411 @SQ SN:ENA|CM000861|CM000861.1 LN:158406734 @SQ SN:ENA|CM000862|CM000862.1 LN:155834243 @SQ SN:ENA|CM000863|CM000863.1 LN:128169293 @SQ SN:ENA|CM000864|CM000864.1 LN:124281992 @SQ SN:ENA|CM000865|CM000865.1 LN:132174527 @SQ SN:ENA|CM000866|CM000866.1 LN:130397257 @SQ SN:ENA|CM000867|CM000867.1 LN:121768101 @SQ SN:ENA|CM000868|CM000868.1 LN:117903854 @SQ SN:ENA|CM000869|CM000869.1 LN:108792865 @SQ SN:ENA|CM000870|CM000870.1 LN:98464013 @SQ SN:ENA|CM000871|CM000871.1 LN:96796970 @SQ SN:ENA|CM000872|CM000872.1 LN:74750902 @SQ SN:ENA|CM000873|CM000873.1 LN:47448759 @SQ SN:ENA|CM000874|CM000874.1 LN:49578535 @SQ SN:ENA|CM000875|CM000875.1 LN:44557958 @SQ SN:ENA|CM000876|CM000876.1 LN:50472720 @SQ SN:ENA|CM000877|CM000877.1 LN:49145316 @SQ SN:ENA|CM000878|CM000878.1 LN:142054208 @SQ SN:ENA|CM000879|CM000879.1 LN:2853901 @PG ID:TopHat VN:2.0.13 CL:/bin/tophat -o ./tophat_out904 -p 6 --b2-very-sensitive --allow-partial-mapping Callithrix_jacchus3.2 ./SRR1168904/trimmedSRR1168904.fastq

Thanks, Yunsoo

akahles commented 9 years ago

Hi Yunsoo, The different order of the records in the header should be no problem. I think what causes the error you observe is that there are contigs in your alignment file that are not present in the annotation or vice versa. This should usually not cause any problem, but seems not to be handled correctly in your case. I am currently traveling and will look into this as soon as I am back. If your analysis is urgent, you could filter your annotation to only contain contigs that are also present in your bam file. Cheers, Andre

warrenmcg commented 9 years ago

Hi Andre,

It's been a while since meeting in November. I hope you're well. I'm glad to see the conversion to the python has been completed. I want to say that I'm currently having a similar issue.

I'm running Spladder on Mouse RNA-Seq data, and using Ensembl v81 GTF annotations.

Here's what happens: +The log terminates after "Loading introns from file..." with the error "ERROR: c logic seems wrong". +The more detailed stdout output reveals an error after this line "41400 (46078) genes done (178208 introns taken) ... took 11 secs". +Doing the same diagnostic test that you suggested above to get the contigs from the GTF and the BAM file, I found that my gtf was missing some of the contigs listed in my bam file (I can send both lists if you want them), as well as vice versa. Examining some of the contigs that were missing in the GTF file, it appears that there aren't any annotated genes in those areas; just some ESTs (for example, JH584300.1).

I also attempted to filter out BAM alignments that were aligned to contigs not found in my GTF, and I got the same error at the same point in the process.

I realize you're probably still traveling, but I wanted to let you know that at least one other user has a similar issue.

EDIT: I filtered my BAM reads to exclude reads that aligned to contigs not found in my GTF, and I filtered my GTF to exclude annotations from contigs not found in my BAM file. That appears to have solved the issue. My analysis is still running right now, but it was able to get past the step it was stopped at before.

Best, Warren

akahles commented 9 years ago

Hi Warren,

Good to hear from you. I hope you are well. Just arrived back from traveling and working on my back log. Will look into your issue very shortly.

// Andre

akahles commented 9 years ago

I have made some changes that hopefully make the code more robust against discrepancies between BAM contigs and annotation contigs. Without being able to reproduce the error, though, it is difficult to debug. @warrenmcg or @bluesky333, if you could reduce the input data to a small testset and send it to me, this would be great. Then I can have another look. The first 10K lines of the BAM file and the full annotation should be sufficient. You can generate a small BAM file using samtools:

samtools view -H <big_file.bam> | head -n 10000 | samtools view -bS -o <small_file.bam> -

Thanks and Cheers, Andre

warrenmcg commented 9 years ago

Hi Andre,

Two things: 1) The -H (capital H) option only prints out the header; it's the -h option we need here. Here is the revised line of code:

samtools view -h <big_file.bam> | head -n 10000 | samtools view -bS -o <small_file.bam> -

2) For me, the difference between my BAM file, and the BAM file filtered to exclude reads aligned to contigs not found in my GTF file, was incredibly tiny (~400 alignments out of 111 million), so a sample from the first 10K alignments did not capture a single alignment aligned to an absent contig. I did a set of commands (see below) to filter out the questionable alignments. I then tested which combinations cause the problem ("bad" GTF + "bad" BAM; bad GTF + "good" BAM; or "good" GTF + bad BAM). It turns out that the error is specifically from the GTF having contigs not found in the BAM file. Andre, I'll email you the files I used so you can reproduce the problem.

The commands I did were the following: step 1: get your list of contigs

cut -f 1 <your_anno_file> | sort -u > list_of_contigs.txt

This redirects Andre's suggested command into a file named "list_of_contigs.txt". Then compare this list to the contig list in your BAM file, and create a new file called "list_of_BadBAMcontigs.txt".

step 2: isolate questionable alignments

samtools view -bh <big_file.bam> `cat </path/to/list_of_BadBAMcontigs.txt> | tr "\n" " "` > filtered_hits.bam
samtools index filtered_hits.bam

In this step, you are taking those contigs that aren't found in the GTF file, and putting them in the format "chr1 chr10 chr11 chr12 ..." etc., with spaces between each name, and feeding that as an argument to samtools. Samtools then writes only alignments that align to those contigs to a new file that I've called "filtered_hits.bam". This approach was taken from a forum post.

Finally, I generated a "good GTF" file from the Ensembl Mouse v81 GTF file provided on their website using grep for "CHR" which were the only contigs not in the BAM file.

akahles commented 9 years ago

Thanks @warrenmcg for the detailed reply. Sorry for the -H typo ... I will look at the example files and will post here once I figured out what the problem is.

akahles commented 9 years ago

I have adapted the code to be more robust against differences between annotation and alignment contigs. I was able to reproduce both errors and the changes fixed the issue in my test cases. Maybe you can give it a try and see if it works for you. I will close this issue for now, but please re-open if the problem persists. Thanks to @warrenmcg and @bluesky333 for contributing!

ratschlab / spladder

Loading introns error #11

Augmenting splice graphs.