Open dbrg77 opened 3 years ago
It seems that your reference file is broken. Can you share the reference genome "genome_chrom.fa" you were using?
Btw, it is supposed to be a better practice if you use all the sequences in the GRCh38 reference genome to map reads and filter on the results by the chromosome names later.
Thanks for the quick reply and tips. The fasta file is good with hisat2
and bowtie2
, so I thought it was okay. In addition, the _alt
chromosomes in GRCh38
sometimes cause problems during mapping, so I removed those chromosomes and other scaffolds. I will try what you suggested in future.
Anyway, the fasta file can be downloaded here: https://xc1-temp.oss-cn-shenzhen.aliyuncs.com/genome_chrom.fa.gz
md5: 7ffe9e9ea56e1671a6ef8c80d77c5a75
I hope the download speed is acceptable.
Thanks for sharing the link. I was able to build the index without any error (see below). Can you run again on the gz file you shared? I still guess there might be problem your fasta file. Also can you tell us the on which system your are running the tool and the compiler you used? Thanks.
./chromap -i -r /tmp/genome_chrom.fa.gz -o /tmp/genome_chrom.index Build index for the reference. Kmer length: 17, window size: 7 Reference file: /tmp/genome_chrom.fa.gz Output file: /tmp/genome_chrom.index Loaded all sequences successfully in 16.59s, number of sequences: 25, number of bases: 3088286401. Collected 737136231 minimizers. Sorted minimizers. Kmer size: 17, window size: 7. Lookup table size: 392946789, # buckets: 536870912, occurrence table size: 445474231, # singletons: 291662000. Built index successfully in 202.59s. [M::Statistics] kmer size: 17; skip: 7; #seq: 25 [M::Statistics::2.716] distinct minimizers: 392946789 (74.22% are singletons); average occurrences: 1.876; average spacing: 4.190 Saved in 36.04s.
Hmm ... I tried with the gz
file, it was still not working. The error happens on our cluster, which uses Red Hat Enterprise Server 7.5 (Maipo)
with gcc v8.2.0
. The default system-wide version is actually gcc v4.8.5
, which failed to compile. I'm using v8.2.0
by loading a module.
I have also tried our local workstation, which runs Ubuntu 18.04
with gcc v7.5.0
, and the genome index was built successfully.
Not sure what's going on ...
How much memory do you have on the cluster and the workstation?
fyi, I compiled the code with gcc 8.2.0 and then run it with the reference genome you provided. I was able to build the index and didn't get any error. So I have no idea why this happened. Maybe on your server, you can try another GCC version (>7) and see if that works.
Thanks for the reply. On the cluster, I submit job with bsub
and request 48G memory, which should be enough. I just found out that I can successfully build index with individual chrs and a combination of just a few chrs (I tried chr1-2
, chr1-3
). It starts to fail when I concatenate 4 chrs (I tried chr1-4
, chr2-5
). It provides a file named like core.858022
with large size.
I don't know what's going on either. I will use our workstation for the time being, and trying to figure out the problem in the meantime.
Thanks for the help so far :-)
It would be helpful for me to understand the problem if you can add one line of the code and then run make clean
and then make
. After that you can run Chromap to build the index again and let me know the output.
After this line https://github.com/haowenz/chromap/blob/6e97125b9cc4f413690684f109c1d5bd9d8c2e79/src/index.cc#L18, and before the line https://github.com/haowenz/chromap/blob/6e97125b9cc4f413690684f109c1d5bd9d8c2e79/src/index.cc#L19 you got the error, just add
std::cerr << "len: " << len << " num bases: " << reference.GetNumBases() << "\n";
Then it will print these two values before the assertion failed. And from these two values, I might be able to see the problem. You can do this with 4 chromosomes, and then all the chromosomes. Thanks!
Your scheduler generated that large file "core.xxx" when the program failed. It is not generated by Chromap. So it is not useful for now.
Okay, done.
Here are the two numbers for chr1-4:
len: 2097152 num bases: 879660065
Here are the two numbers for all chromosomes:
len: 69810327 num bases: 3088286401
This is so weird. We have no idea why this happened. And we could not reproduce the error which made it hard for us to debug. In rare case, it can also be other issues rather than Chromap bugs. So we will keep this open and see if others get into the same trouble. Thanks again for providing the data and testing the tool!
Hi, just realised that I forgot to report back. Sorry!!!
I think the problem is not in the reference file. There might be something wrong with my environment on the cluster. Basically, if I download the source and compile on the cluster, it successfully compiles but fails to run, i.e. showing the errors mention in the previous posts. However, if I download the pre-built binary, everything works fine.
It is so strange. Anyway, thanks very much for the help !!
Hi, thanks for the tool
chromap
, which looks really promising based on the preprint.I'm having trouble creating genome index for human. I don't have any problem with the test data, and I successfully created the index of
test/ref.fa
. However, when I run the following command for human genome:~/tools/chromap/chromap -i -r ~/reference/homo_sapiens/ucsc/hg38/genome_fa/genome_chrom.fa -o hs_chrom_index
where
genome_chrom.fa
contains fasta records of human chr1-22, chrX, chrY and chrM. I got the following output and error:I tired the pre-compile binary and cloning the github repo and compiling by myself. The same thing happened.
Could you help? Thank you!
Regards, Xi