hsinnan75 / GSAlign

GSAlign: an ultra-fast sequence alignment algorithm for intra-species genome comparison
MIT License
51 stars 16 forks source link

Please specify a valid reference genome #8

Closed bellstwohearted closed 3 years ago

bellstwohearted commented 3 years ago

Thank you for developing such a powerful tool. I compiled it and the run_test.sh works well.

I want to test the hg38-1 set, so I downloaded the zip file:

-rw-r----- 1  500  500         40 Jul 16  2019 hg38-1.info
-rw-r----- 1  500  500 3132397894 Jul 16  2019 hg38-1.mut
-rw-r----- 1  500  500 2605285065 Jul 16  2019 hg38-1.vcf

While I run the following command:

gsalign -t 4 -i dataset/hg38-1 -q dataset/hg38-1.mut -o test/output

I got the following error:

Step1. Load the two genome sequences...
        Load the query sequences (25 chromosomes)
Please specify a valid reference genome

So, could you please let me know how to fix this?

hsinnan75 commented 3 years ago

-i should be followed by an index prefix. In you case, there should be files of hg38-1.bwt, hg38-1.amb, hg38-1.ann, ..... If you don't have those index files, you could use -r to specify the reference file, which is a fasta format file. Since I don't see those index files in your folder, you should use "-r $PATH/hg38-1.fasta" instead of using "-i hg38-1". Or you could use bwt_index to generate the index files for your reference genome.

bellstwohearted commented 3 years ago

-i should be followed by an index prefix. In you case, there should be files of hg38-1.bwt, hg38-1.amb, hg38-1.ann, ..... If you don't have those index files, you could use -r to specify the reference file, which is a fasta format file. Since I don't see those index files in your folder, you should use "-r $PATH/hg38-1.fasta" instead of using "-i hg38-1". Or you could use bwt_index to generate the index files for your reference genome.

Thank you for providing these suggestions. I know that a .fasta file can be called by -r, but I did not see such a file in the dataset. I am wondering whether I can generate those index files directly from the .vcf file using bwt_index?

bellstwohearted commented 3 years ago

I have tried

./bwt_index ../dataset/hg38-1.vcf ../dataset/hg38-1
[bwt_index] Pack FASTA... 5.08 sec
[bwt_index] Construct BWT for the packed sequence...
Floating point exception (core dumped)

So, how to fix this? Thank you~

hsinnan75 commented 3 years ago

bwt_index is used to index a genome file with FASTA format. VCF is not in FASTA format, it is for variant annotation. A fasta file looks like

seq1 ACGT.... seq2 GTAG...

You need to download the human genome sequence (either in .FASTA or .fna) and run bwt_index again.

Let me know if you still have trouble running bwt_index or GSAlign. Thank you!

bellstwohearted commented 3 years ago

bwt_index is used to index a genome file with FASTA format. VCF is not in FASTA format, it is for variant annotation. A fasta file looks like

seq1 ACGT.... seq2 GTAG...

You need to download the human genome sequence (either in .FASTA or .fna) and run bwt_index again.

Let me know if you still have trouble running bwt_index or GSAlign. Thank you!

Thank you! If I look at this link: http://bioapp.iis.sinica.edu.tw/~arith/GSAlign/Datasets/ Is the hg38.fa file the one you meant, as the fasta file? If so, does it contain the fasta file corresponding to hg38-1, hg38-3 and hg38-5?

hsinnan75 commented 3 years ago

I'm not sure what hg38-1, hg38-3 and hg38-5 are. Do you mean chromosome 1, 3, and 5? There are chr1, chr2, ..., chr22, chrX, chrY, and chrM in the file of hg38.fa. In fact, I used the same indexes as bwa does. If you already have built indexes using bwa, then you can just use those files to run GSAlign.

bellstwohearted commented 3 years ago

I'm not sure what hg38-1, hg38-3 and hg38-5 are. Do you mean chromosome 1, 3, and 5? There are chr1, chr2, ..., chr22, chrX, chrY, and chrM in the file of hg38.fa. In fact, I used the same indexes as bwa does. If you already have built indexes using bwa, then you can just use those files to run GSAlign.

I am looking at table 2 in your article and want to give a try to see what will happen. 图像_2020-11-20_093949

Thank you for the advices and I will give a try.

hsinnan75 commented 3 years ago

I forget those datasets. Now I understood what you mean by hg38-1 and the others. You can download those datasets (hg38-.tgz) at http://bioapp.iis.sinica.edu.tw/~arith/GSAlign/Datasets/. After you decompress those datasets, you will get files called hg38-.mut which were modified from hg38.fa. But you still need to download the original hg38 genome sequence as the reference to compare to the modified versions.

bellstwohearted commented 3 years ago

I forget those datasets. Now I understood what you mean by hg38-1 and the others. You can download those datasets (hg38-.tgz) at http://bioapp.iis.sinica.edu.tw/~arith/GSAlign/Datasets/. After you decompress those datasets, you will get files called hg38-.mut which were modified from hg38.fa. But you still need to download the original hg38 genome sequence as the reference to compare to the modified versions.

Well, I downloaded the hg38.fa file and run

bwa index hg38.fa

It takes about 6000 seconds to generate several files, then I run

gsalign -t 4 -i dataset/hg38-1 -q dataset/hg38-1.mut -o test/output

but it still gives an error, saying Please specify a valid reference genome.

Next I run

gsalign -t 4 -r dataset/hg38.fa -q dataset/hg38-1.mut -o test/output

It is still running, but is repeating the action what bwa index has already done, so I do not know what will happen later.

hsinnan75 commented 3 years ago

when you run GSAlign and use "-i" argument, you should give the prefix of index files after "-i". For example, if the prefix is hg38.fa (meaning there are hg38.fa.xxx files), then you can run GSAlign with "-i hg38.fa".

If you use "-i dataset/hg38-1", you should provide index files whose prefix is dataset/hg38-1.

bellstwohearted commented 3 years ago

when you run GSAlign and use "-i" argument, you should give the prefix of index files after "-i". For example, if the prefix is hg38.fa (meaning there are hg38.fa.xxx files), then you can run GSAlign with "-i hg38.fa".

If you use "-i dataset/hg38-1", you should provide index files whose prefix is dataset/hg38-1.

Sorry I misunderstood what you meant by prefix, now it is running in good order, and enters

Step2. Sequence analysis for all query chromosomes

Thank you!