itmat / rum

RNA-Seq Unified Mapper
http://cbil.upenn.edu/RUM
MIT License
26 stars 4 forks source link

Can't create new indices #159

Open tianyang-li opened 11 years ago

tianyang-li commented 11 years ago

I was trying to create new indices, but it seems that an error occured.

I used both my own transcript models and those downloaded from the UCSC Genome Browser.

You don't seem to have Log::Log4perl installed. You may want to install it
via "cpan -i Log::Log4perl" so you can use advanced logging features.
Expected a number in the exon starts col at create_indexes_from_ucsc.pl line 97

==> log/rum_errors.log <==
Tue Dec 25 17:11:56 2012 12811 FATAL RUM::Death - Expected a number in the exon starts col at create_indexes_from_ucsc.pl line 97

==> log/rum.log <==
Tue Dec 25 17:07:47 2012 12811  INFO RUM::UI - START modify_fasta_header_for_genome_seq_database hs_genome.txt hs_genome.fa
Tue Dec 25 17:09:02 2012 12811  INFO RUM::UI - END modify_fasta_header_for_genome_seq_database hs_genome.txt hs_genome.fa
Tue Dec 25 17:09:02 2012 12811  INFO RUM::UI - START modify_fa_to_have_seq_on_one_line hs_genome.fa hs_genome_one-line-seqs_temp.fa
Tue Dec 25 17:10:40 2012 12811  INFO RUM::UI - END modify_fa_to_have_seq_on_one_line hs_genome.fa hs_genome_one-line-seqs_temp.fa
Tue Dec 25 17:10:40 2012 12811  INFO RUM::UI - START sort_genome_fa_by_chr hs_genome_one-line-seqs_temp.fa hs/hs_genome_one-line-seqs.fa
Tue Dec 25 17:11:56 2012 12811  INFO RUM::UI - END sort_genome_fa_by_chr hs_genome_one-line-seqs_temp.fa hs/hs_genome_one-line-seqs.fa
Tue Dec 25 17:11:56 2012 12811  INFO main - Removing temporary files hs_genome_one-line-seqs_temp.fa
Tue Dec 25 17:11:56 2012 12811  INFO RUM::UI - START make_master_file_of_genes gene_info_files gene_info_merged_unsorted.txt
Tue Dec 25 17:11:56 2012 12811 FATAL RUM::Death - Expected a number in the exon starts col at create_indexes_from_ucsc.pl line 97
mdelaurentis commented 11 years ago

Hi,

We have seen this error before, and it was due to a missing header line in one of the gene files. The gene file should have a header row that looks something like this (tab-delimited):

name chrom strand exonStarts exonEnds

I believe UCSC normally includes that header row in the file, but if you're using an annotation file from another source, I suppose it may not be there. Since we've now had two users see this error, maybe I'll go ahead and change the index creation scripts so that they work without it.

Can you please try adding that header row to the input file, and let me know if that works?

Thanks,

Mike

On Tue, Dec 25, 2012 at 4:19 AM, Tianyang Li 李天阳 notifications@github.comwrote:

I was trying to create new indices, but it seems that an error occured.

You don't seem to have Log::Log4perl installed. You may want to install it via "cpan -i Log::Log4perl" so you can use advanced logging features. Expected a number in the exon starts col at create_indexes_from_ucsc.pl line 97 Tue Dec 25 17:11:56 CST 2012

==> log/rum_errors.log <== Tue Dec 25 17:11:56 2012 12811 FATAL RUM::Death - Expected a number in the exon starts col at create_indexes_from_ucsc.pl line 97

==> log/rum.log <== Tue Dec 25 17:07:47 2012 12811 INFO RUM::UI - START modify_fasta_header_for_genome_seq_database hs_genome.txt hs_genome.fa Tue Dec 25 17:09:02 2012 12811 INFO RUM::UI - END modify_fasta_header_for_genome_seq_database hs_genome.txt hs_genome.fa Tue Dec 25 17:09:02 2012 12811 INFO RUM::UI - START modify_fa_to_have_seq_on_one_line hs_genome.fa hs_genome_one-line-seqs_temp.fa Tue Dec 25 17:10:40 2012 12811 INFO RUM::UI - END modify_fa_to_have_seq_on_one_line hs_genome.fa hs_genome_one-line-seqs_temp.fa Tue Dec 25 17:10:40 2012 12811 INFO RUM::UI - START sort_genome_fa_by_chr hs_genome_one-line-seqs_temp.fa hs/hs_genome_one-line-seqs.fa Tue Dec 25 17:11:56 2012 12811 INFO RUM::UI - END sort_genome_fa_by_chr hs_genome_one-line-seqs_temp.fa hs/hs_genome_one-line-seqs.fa Tue Dec 25 17:11:56 2012 12811 INFO main - Removing temporary files hs_genome_one-line-seqs_temp.fa Tue Dec 25 17:11:56 2012 12811 INFO RUM::UI - START make_master_file_of_genes gene_info_files gene_info_merged_unsorted.txt Tue Dec 25 17:11:56 2012 12811 FATAL RUM::Death - Expected a number in the exon starts col at create_indexes_from_ucsc.pl line 97

— Reply to this email directly or view it on GitHubhttps://github.com/PGFI/rum/issues/159.

tianyang-li commented 11 years ago

I added

#name   chrom   strand  exonStarts  exonEnds

to the start of my file for transcript models, however it still gives this error

Expected a number in the exon starts col at create_indexes_from_ucsc.pl line 97

Here are the first 2 lines from my file

#name   chrom   strand  exonStarts  exonEnds
HG531_PATCH HG531_PATCH +   0,  34662,
delagoya commented 11 years ago

The script is sensitive to tabs versus spaces. Please ensure that when you added the header line, that saving the file did not result in the tabs being turned into spaces.

tianyang-li commented 11 years ago

I've made sure that the header line is tab separated, but it still gives that error. Here are the first 2 lines:

'#name\tchrom\tstrand\texonStarts\texonEnds\nHG531_PATCH\tHG531_PATCH\t+\t0,\t34662,\n'
mdelaurentis commented 11 years ago

Have you had any more luck with creating the index? That error basically means that it found a value in the exonStarts column that was not valid. Before the header row was there, it was complaining because it couldn't find the column. Now that the header row is there, it likely means that there is a row that has a badly formatted value. Every value in the exonStarts and exonEnds columns should be a comma-delimited list of integers. You should be able to check the file by doing something like:

cut -f4 INPUT | perl -ne 'next if $. == 1; /^(\d+)(,\d+)/ or die "Bad row $ on line $."' cut -f5 INPUT | perl -ne 'next if $. == 1; /^(\d+)(,\d+)/ or die "Bad row $ on line $."'

I just changed the script so that it gives a slightly more useful error message. That will show up in the next release.

On Wed, Dec 26, 2012 at 10:51 PM, Tianyang Li 李天阳 notifications@github.comwrote:

I've made sure that the header line is tab separated, but it still gives that error. Here are the first 2 lines:

'#name\tchrom\tstrand\texonStarts\texonEnds\nHG531_PATCH\tHG531_PATCH\t+\t0,\t34662,\n'

On Thu, Dec 27, 2012 at 11:19 AM, Angel Pizarro notifications@github.comwrote:

The script is sensitive to tabs versus spaces. Please ensure that when you added the header line, that saving the file did not result in the tabs being turned into spaces.

— Reply to this email directly or view it on GitHub< https://github.com/PGFI/rum/issues/159#issuecomment-11699654>.

— Reply to this email directly or view it on GitHubhttps://github.com/PGFI/rum/issues/159#issuecomment-11700483.

tianyang-li commented 11 years ago

I'm still getting this error although

cut -f4 INPUT | perl -ne 'next if $. == 1; /^(\d+)(,\d+)*/ or die "Bad row $_ on line $."'
cut -f5 INPUT | perl -ne 'next if $. == 1; /^(\d+)(,\d+)*/ or die "Bad row $_ on line $."'

showed my file is OK.

bioinfo89 commented 6 years ago

Hi,

I am facing this a similar issue with creating indexes using ucsc perl script. The error is as follows:

ERROR: exon for chr19:24161509-24161590 not found. chr19 - 24161508 24163354 6 24161508,24161808,24162104,24162560,24162695,24163090, 24161590,24161985,24162480,24162611,24162790,24163354, ENSDART00000052461.6(danRer10_ensemblgene) i=0 at /home/kchandratre/tools/rum-master/bin/create_indexes_from_ucsc.pl line 97.

Any help will be appreciated.