Failed to create genome index

dbrg77 commented 3 years ago

Hi, thanks for the tool chromap, which looks really promising based on the preprint.

I'm having trouble creating genome index for human. I don't have any problem with the test data, and I successfully created the index of test/ref.fa. However, when I run the following command for human genome:

~/tools/chromap/chromap -i -r ~/reference/homo_sapiens/ucsc/hg38/genome_fa/genome_chrom.fa -o hs_chrom_index

where genome_chrom.fa contains fasta records of human chr1-22, chrX, chrY and chrM. I got the following output and error:

Build index for the reference.
Kmer length: 17, window size: 7
Reference file: /work/bio-chenx/reference/homo_sapiens/ucsc/hg38/genome_fa/genome_chrom.fa
Output file: hs_chrom_index
Loaded all sequences successfully in 4.27s, number of sequences: 25, number of bases: 3088286401.
Collected 737136231 minimizers.
Sorted minimizers.
Kmer size: 17, window size: 7.
Lookup table size: 392946789, # buckets: 536870912, occurrence table size: 445474231, # singletons: 291662000.
Built index successfully in 197.98s.
[M::Statistics] kmer size: 17; skip: 7; #seq: 25
chromap: src/index.cc:19: void chromap::Index::Statistics(uint32_t, const chromap::SequenceBatch&): Assertion `len == reference.GetNumBases()' failed.
Aborted (core dumped)

I tired the pre-compile binary and cloning the github repo and compiling by myself. The same thing happened.

Could you help? Thank you!

Regards, Xi

haowenz commented 3 years ago

It seems that your reference file is broken. Can you share the reference genome "genome_chrom.fa" you were using?

Btw, it is supposed to be a better practice if you use all the sequences in the GRCh38 reference genome to map reads and filter on the results by the chromosome names later.

dbrg77 commented 3 years ago

Thanks for the quick reply and tips. The fasta file is good with hisat2 and bowtie2, so I thought it was okay. In addition, the _alt chromosomes in GRCh38 sometimes cause problems during mapping, so I removed those chromosomes and other scaffolds. I will try what you suggested in future.

Anyway, the fasta file can be downloaded here: https://xc1-temp.oss-cn-shenzhen.aliyuncs.com/genome_chrom.fa.gz

md5: 7ffe9e9ea56e1671a6ef8c80d77c5a75

I hope the download speed is acceptable.

haowenz commented 3 years ago

Thanks for sharing the link. I was able to build the index without any error (see below). Can you run again on the gz file you shared? I still guess there might be problem your fasta file. Also can you tell us the on which system your are running the tool and the compiler you used? Thanks.

./chromap -i -r /tmp/genome_chrom.fa.gz -o /tmp/genome_chrom.index Build index for the reference. Kmer length: 17, window size: 7 Reference file: /tmp/genome_chrom.fa.gz Output file: /tmp/genome_chrom.index Loaded all sequences successfully in 16.59s, number of sequences: 25, number of bases: 3088286401. Collected 737136231 minimizers. Sorted minimizers. Kmer size: 17, window size: 7. Lookup table size: 392946789, # buckets: 536870912, occurrence table size: 445474231, # singletons: 291662000. Built index successfully in 202.59s. [M::Statistics] kmer size: 17; skip: 7; #seq: 25 [M::Statistics::2.716] distinct minimizers: 392946789 (74.22% are singletons); average occurrences: 1.876; average spacing: 4.190 Saved in 36.04s.

dbrg77 commented 3 years ago

Hmm ... I tried with the gz file, it was still not working. The error happens on our cluster, which uses Red Hat Enterprise Server 7.5 (Maipo) with gcc v8.2.0. The default system-wide version is actually gcc v4.8.5, which failed to compile. I'm using v8.2.0 by loading a module.

I have also tried our local workstation, which runs Ubuntu 18.04 with gcc v7.5.0, and the genome index was built successfully.

Not sure what's going on ...

haowenz commented 3 years ago

How much memory do you have on the cluster and the workstation?

haowenz commented 3 years ago

fyi, I compiled the code with gcc 8.2.0 and then run it with the reference genome you provided. I was able to build the index and didn't get any error. So I have no idea why this happened. Maybe on your server, you can try another GCC version (>7) and see if that works.

dbrg77 commented 3 years ago

Thanks for the reply. On the cluster, I submit job with bsub and request 48G memory, which should be enough. I just found out that I can successfully build index with individual chrs and a combination of just a few chrs (I tried chr1-2, chr1-3). It starts to fail when I concatenate 4 chrs (I tried chr1-4, chr2-5). It provides a file named like core.858022 with large size.

I don't know what's going on either. I will use our workstation for the time being, and trying to figure out the problem in the meantime.

Thanks for the help so far :-)

haowenz commented 3 years ago

It would be helpful for me to understand the problem if you can add one line of the code and then run make clean and then make. After that you can run Chromap to build the index again and let me know the output.

After this line https://github.com/haowenz/chromap/blob/6e97125b9cc4f413690684f109c1d5bd9d8c2e79/src/index.cc#L18, and before the line https://github.com/haowenz/chromap/blob/6e97125b9cc4f413690684f109c1d5bd9d8c2e79/src/index.cc#L19 you got the error, just add

std::cerr << "len: " << len << " num bases: " << reference.GetNumBases() << "\n";

Then it will print these two values before the assertion failed. And from these two values, I might be able to see the problem. You can do this with 4 chromosomes, and then all the chromosomes. Thanks!

Your scheduler generated that large file "core.xxx" when the program failed. It is not generated by Chromap. So it is not useful for now.

dbrg77 commented 3 years ago

Okay, done.

Here are the two numbers for chr1-4:

len: 2097152 num bases: 879660065

Here are the two numbers for all chromosomes:

len: 69810327 num bases: 3088286401

haowenz commented 3 years ago

This is so weird. We have no idea why this happened. And we could not reproduce the error which made it hard for us to debug. In rare case, it can also be other issues rather than Chromap bugs. So we will keep this open and see if others get into the same trouble. Thanks again for providing the data and testing the tool!

dbrg77 commented 2 years ago

Hi, just realised that I forgot to report back. Sorry!!!

I think the problem is not in the reference file. There might be something wrong with my environment on the cluster. Basically, if I download the source and compile on the cluster, it successfully compiles but fails to run, i.e. showing the errors mention in the previous posts. However, if I download the pre-built binary, everything works fine.

It is so strange. Anyway, thanks very much for the help !!

haowenz / chromap

Failed to create genome index #18