Closed bellstwohearted closed 3 years ago
-i should be followed by an index prefix. In you case, there should be files of hg38-1.bwt, hg38-1.amb, hg38-1.ann, ..... If you don't have those index files, you could use -r to specify the reference file, which is a fasta format file. Since I don't see those index files in your folder, you should use "-r $PATH/hg38-1.fasta" instead of using "-i hg38-1". Or you could use bwt_index to generate the index files for your reference genome.
-i should be followed by an index prefix. In you case, there should be files of hg38-1.bwt, hg38-1.amb, hg38-1.ann, ..... If you don't have those index files, you could use -r to specify the reference file, which is a fasta format file. Since I don't see those index files in your folder, you should use "-r $PATH/hg38-1.fasta" instead of using "-i hg38-1". Or you could use bwt_index to generate the index files for your reference genome.
Thank you for providing these suggestions. I know that a .fasta
file can be called by -r
, but I did not see such a file in the dataset. I am wondering whether I can generate those index files directly from the .vcf
file using bwt_index
?
I have tried
./bwt_index ../dataset/hg38-1.vcf ../dataset/hg38-1
[bwt_index] Pack FASTA... 5.08 sec
[bwt_index] Construct BWT for the packed sequence...
Floating point exception (core dumped)
So, how to fix this? Thank you~
bwt_index is used to index a genome file with FASTA format. VCF is not in FASTA format, it is for variant annotation. A fasta file looks like
seq1 ACGT.... seq2 GTAG...
You need to download the human genome sequence (either in .FASTA or .fna) and run bwt_index again.
Let me know if you still have trouble running bwt_index or GSAlign. Thank you!
bwt_index is used to index a genome file with FASTA format. VCF is not in FASTA format, it is for variant annotation. A fasta file looks like
seq1 ACGT.... seq2 GTAG...
You need to download the human genome sequence (either in .FASTA or .fna) and run bwt_index again.
Let me know if you still have trouble running bwt_index or GSAlign. Thank you!
Thank you! If I look at this link: http://bioapp.iis.sinica.edu.tw/~arith/GSAlign/Datasets/
Is the hg38.fa
file the one you meant, as the fasta file? If so, does it contain the fasta file corresponding to hg38-1, hg38-3 and hg38-5?
I'm not sure what hg38-1, hg38-3 and hg38-5 are. Do you mean chromosome 1, 3, and 5? There are chr1, chr2, ..., chr22, chrX, chrY, and chrM in the file of hg38.fa. In fact, I used the same indexes as bwa does. If you already have built indexes using bwa, then you can just use those files to run GSAlign.
I'm not sure what hg38-1, hg38-3 and hg38-5 are. Do you mean chromosome 1, 3, and 5? There are chr1, chr2, ..., chr22, chrX, chrY, and chrM in the file of hg38.fa. In fact, I used the same indexes as bwa does. If you already have built indexes using bwa, then you can just use those files to run GSAlign.
I am looking at table 2 in your article and want to give a try to see what will happen.
Thank you for the advices and I will give a try.
I forget those datasets. Now I understood what you mean by hg38-1 and the others. You can download those datasets (hg38-.tgz) at http://bioapp.iis.sinica.edu.tw/~arith/GSAlign/Datasets/. After you decompress those datasets, you will get files called hg38-.mut which were modified from hg38.fa. But you still need to download the original hg38 genome sequence as the reference to compare to the modified versions.
I forget those datasets. Now I understood what you mean by hg38-1 and the others. You can download those datasets (hg38-.tgz) at http://bioapp.iis.sinica.edu.tw/~arith/GSAlign/Datasets/. After you decompress those datasets, you will get files called hg38-.mut which were modified from hg38.fa. But you still need to download the original hg38 genome sequence as the reference to compare to the modified versions.
Well, I downloaded the hg38.fa
file and run
bwa index hg38.fa
It takes about 6000 seconds to generate several files, then I run
gsalign -t 4 -i dataset/hg38-1 -q dataset/hg38-1.mut -o test/output
but it still gives an error, saying Please specify a valid reference genome.
Next I run
gsalign -t 4 -r dataset/hg38.fa -q dataset/hg38-1.mut -o test/output
It is still running, but is repeating the action what bwa index
has already done, so I do not know what will happen later.
when you run GSAlign and use "-i" argument, you should give the prefix of index files after "-i". For example, if the prefix is hg38.fa (meaning there are hg38.fa.xxx files), then you can run GSAlign with "-i hg38.fa".
If you use "-i dataset/hg38-1", you should provide index files whose prefix is dataset/hg38-1.
when you run GSAlign and use "-i" argument, you should give the prefix of index files after "-i". For example, if the prefix is hg38.fa (meaning there are hg38.fa.xxx files), then you can run GSAlign with "-i hg38.fa".
If you use "-i dataset/hg38-1", you should provide index files whose prefix is dataset/hg38-1.
Sorry I misunderstood what you meant by prefix
, now it is running in good order, and enters
Step2. Sequence analysis for all query chromosomes
Thank you!
Thank you for developing such a powerful tool. I compiled it and the
run_test.sh
works well.I want to test the hg38-1 set, so I downloaded the zip file:
While I run the following command:
I got the following error:
So, could you please let me know how to fix this?