twolinin / longphase

GNU General Public License v3.0
98 stars 6 forks source link

Failed to open FASTA file with v1.7 #74

Closed quantrannnn closed 1 week ago

quantrannnn commented 1 month ago

Hi, I ran Longphase v1.7 but got the error: [E::fai_load3_core] Failed to open FASTA file. It worked well with v1.6. Could you fix this in the new release? Thank you.

ythuang0522 commented 1 month ago

Hi @quantrannnn This error message indicates the loss of fasta index (fai). It's odd as we have upgraded from v1.6 to v1.7, v1.7.1, and 1.7.2 for quite some time without having this issue and we didn't make any change regarding the loading of FASTA. Can you first check if the fai index is broken (e.g., by regenerating with SAMtools) and if the path is correct? If everything looks fine, we need more specific commands (arguments and version number) and the human genome version you are using. Thanks.

quantrannnn commented 1 month ago

Hi @ythuang0522, the FASTA index file and the path are fine. I used version 1.7.2 first but got that error, so I removed it and tried again with version 1.6, and it worked. The input files and commands are the same for both versions; ./longphase_linux-x64 phase \ -b sample.bam \ -s sample_SN.vcf \ --sv-file sample_SV.vcf \ -r ARS_UCD_v2.0.fa \ #cattle's reference genome -o sample_out \ -t 8 \ --ont

ythuang0522 commented 1 month ago

Hi @quantrannnn, Thank you and the command looks fine. We will download the cattle genome and see if it can be successfully loaded in our env and compare with your fai. Btw, have you ever successfully loaded other genome such as GRCh38?

quantrannnn commented 1 month ago

Yeah, I just checked it. It seems like version 1.7.2 works with GRCh38 but not with my reference genome. Version 1.6 works with both.

ythuang0522 commented 1 month ago

We downloaded the cattle ref genome and successfully loaded the fasta without errors. However, we don't have the bam and vcf for downstream tasks. Can you provide a partial bam and vcf for us debugging? e.g., samtools view input.bam "Chr10:18000-45500" > output.bam

quantrannnn commented 1 month ago

Hi,

You can use these small test files for debugging. If you use the ARS_UCD_v2.0 reference genome from NCBI, please convert the chromosome names to 1, 2, 3,..., 29, X, Y. It seems to me that the error is not in loading the fasta file but in loading the header of VCF files. Could the error occur because there are too many contigs in my reference genome?

ythuang0522 commented 1 month ago

Hi we didn’t see the test files. Can you send to my email ythuang at ccu.edu.tw or attached in the response?

leonschuetz commented 1 month ago

Hi, I ran in the same/similar error, but we use a standard GRCh38 genome:

LongPhase Ver 1.7.2

--- File Parameter ---
SNP File      : /.../23014LRa023L2_01_var.vcf.gz_unphased.vcf.gz
SV  File      : /.../23014LRa023L2_01_var_structural_variants.vcf.gz_unphased.vcf.gz
MOD File      : /.../23014LRa023L2_01_var_modcall.vcf
REF File      : /mnt/storage2/megSAP/data/genomes/GRCh38.fa
Output Prefix : /.../longphase1.7.2
Generate Dot  : False
BAM File      : /.../23014LRa023L2_01.bam

--- Phasing Parameter ---
Seq Platform       : ONT
Phase Indel        : True
Distance Threshold : 300000
Connect Adjacent   : 20
Edge Threshold     : 0.7
Overlap Threshold  : 0.2
Mapping Quality    : 1
Variant Confidence : 0.75
ReadTag Confidence : 0.65

parsing VCF ... 7s
parsing SV VCF ... 0s
parsing Meth VCF ... 14s
reading reference ... [E::fai_load3_core] Failed to open FASTA file /mnt/storage2/megSAP/data/genomes/GRCh38.fa
make: *** [Makefile:90: test_longphase_1.7.2] Segmentation fault (core dumped)

with the following command:

/mnt/storage2/megSAP/tools/longphase_v1.7.2/longphase_linux-x64 phase --ont --indels -t 5 \
-s /.../23014LRa023L2_01_var.vcf.gz_unphased.vcf.gz \
-b /.../23014LRa023L2_01.bam \
-r /mnt/storage2/megSAP/data/genomes/GRCh38.fa \
-o /.../longphase1.7.2 \
--sv-file /.../23014LRa023L2_01_var_structural_variants.vcf.gz_unphased.vcf.gz \
--mod-file /.../23014LRa023L2_01_var_modcall.vcf

(I replaced the actual path with /.../ to make it more readable)

It works fine with v1.6, so I think the Input is fine and it's a problem with 1.7(.2)

ythuang0522 commented 1 month ago

Hi @leonschuetz, did you re-download the binary version v1.7.2 in Github which we updated two weeks ago? https://github.com/twolinin/longphase/releases/tag/v1.7.2 We fix a compiling issue in previous release of v1.7.2. The fasta-loading issue has been resolved after @quantrannnn updated to the latest version (we communicate through emails). Please let us know if the updated version also resolve yours.

leonschuetz commented 1 month ago

Hi @ythuang0522, Thanks for the quick response I tried the current binary for 1.7.2 and I build the current main branch and still got the same error. We are useing Ubuntu 20.04.6 LTS. Did somthing change with the requirements?

I also ran I on 2 Servers now, (one fairly old Intel server and one new AMD server) always the same error.

ythuang0522 commented 1 month ago

@leonschuetz Did you mean by recompiling from the scratch? I expect you directly run the binary file without re-compilation. If you insist to re-compile, you have to be sure you have type 'make clean' to clean all previous object files. If you mean run the binary file without compilation and still got the same errors, may I know your GRCh38 contains ALT or not? We are using GRCh38 without ALT contigs and are wondering if this may be the cause.

leonschuetz commented 1 month ago

I ran the binary which is provided on the release page and I also tested the version I compiled myself (which I haven't done before, so there should be no previous versions). Both programs raise the same exception.

We use a standard GRCh38 with masked duplications and without ALT contigs but with decoys. ( You can download it here)

twolinin commented 1 month ago

@leonschuetz We have checked the reference and found no issues. Could you please provide the header of the problematic VCF or the VCF file you used? Errors can occur if the reference and the contig names recorded in the VCF are not consistent.

image

leonschuetz commented 1 month ago

Hi,

here are the headers. All files use ##contig=<ID=chrMT,length=16569> for the mitochondrial chromosome. headers.zip

twolinin commented 1 month ago

@leonschuetz We tested using the header and reference you provided in a new Ubuntu environment. Unfortunately, I was unable to replicate the error message.

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
$ ./longphase_linux-x64 phase --ont -t 96 -b tmp.bam -s snv_header.vcf --sv-file sv_header.vcf --mod-file mod_header.vcf -o test -r GRCh38.fa --indels
LongPhase Ver 1.7.2

--- File Parameter ---
SNP File      : snv_header.vcf
SV  File      : sv_header.vcf
MOD File      : mod_header.vcf
REF File      : GRCh38.fa
Output Prefix : test
Generate Dot  : False
BAM File      : tmp.bam

--- Phasing Parameter ---
Seq Platform       : ONT
Phase Indel        : True
Distance Threshold : 300000
Connect Adjacent   : 20
Edge Threshold     : 0.7
Overlap Threshold  : 0.2
Mapping Quality    : 1
Variant Confidence : 0.75
ReadTag Confidence : 0.65

parsing VCF ... 0s
parsing SV VCF ... 0s
parsing Meth VCF ... 0s
reading reference ... 1s

parsing total:  0s
merge results ... 0s
writeResult SNP ... 0s
write SV Result ... 0s
write mod Result ... 0s

total process: 1s

Could you please download our data to test if longphase runs correctly? https://drive.google.com/drive/folders/1FNg_GCF7o8ywYMipw8kbrbFfAqDCX00a

$ ./longphase_linux-x64 phase -b hg002.sup.10x.cram --ont -t 24 -r GCA_000001405.15_GRCh38_no_alt_analysis_set.fa -s pepper_10x.vcf.gz --sv-file sniffles_v2.2_10x.vcf.gz --mod-file modcall_10x.vcf.gz --indels -o test
LongPhase Ver 1.7.2

--- File Parameter ---
SNP File      : pepper_10x.vcf.gz
SV  File      : sniffles_v2.2_10x.vcf.gz
MOD File      : modcall_10x.vcf.gz
REF File      : GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
Output Prefix : test
Generate Dot  : False
BAM File      : hg002.sup.10x.cram

--- Phasing Parameter ---
Seq Platform       : ONT
Phase Indel        : True
Distance Threshold : 300000
Connect Adjacent   : 20
Edge Threshold     : 0.7
Overlap Threshold  : 0.2
Mapping Quality    : 1
Variant Confidence : 0.75
ReadTag Confidence : 0.65

parsing VCF ... 6s
parsing SV VCF ... 0s
parsing Meth VCF ... 2s
reading reference ... 1s
(chrY,6s)(chr22,9s)(chr18,10s)(chrX,11s)(chr15,14s)(chr19,15s)(chr14,15s)(chr17,17s)(chr9,17s)(chr21,17s)(chr16,18s)(chr8,18s)(chr13,19s)(chr20,19s)(chr10,21s)(chr12,21s)(chr11,22s)(chr6,22s)(chr5,23s)(chr7,23s)(chr3,25s)(chr4,28s)(chr2,29s)(chr1,34s)
parsing total:  35s
merge results ... 1s
writeResult SNP ... 16s
write SV Result ... 1s
write mod Result ... 2s

total process: 64s
leonschuetz commented 1 month ago

Hi,

I ran your test files and they work:

mkdir -p /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/out
/mnt/storage2/megSAP/tools/longphase_v1.7.2/longphase_linux-x64 phase --ont --indels -t 5 \
-s /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/pepper_10x.vcf.gz \
-b /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/hg002.sup.10x.cram \
-r /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
-o /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/out \
--sv-file /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/sniffles_v2.2_10x.vcf.gz \
--mod-file /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/modcall_10x.vcf.gz
LongPhase Ver 1.7.2

--- File Parameter ---
SNP File      : /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/pepper_10x.vcf.gz
SV  File      : /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/sniffles_v2.2_10x.vcf.gz
MOD File      : /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/modcall_10x.vcf.gz
REF File      : /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
Output Prefix : /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/out
Generate Dot  : False
BAM File      : /mnt/storage2/users/ahschul1/backup/projects/2024_04_lr_cram/+data/longphase_test_files/hg002.sup.10x.cram

--- Phasing Parameter ---
Seq Platform       : ONT
Phase Indel        : True
Distance Threshold : 300000
Connect Adjacent   : 20
Edge Threshold     : 0.7
Overlap Threshold  : 0.2
Mapping Quality    : 1
Variant Confidence : 0.75
ReadTag Confidence : 0.65

parsing VCF ... 6s
parsing SV VCF ... 1s
parsing Meth VCF ... 3s
reading reference ... 26s
(chr5,27s)(chr3,27s)(chr2,33s)(chr4,33s)(chr1,39s)(chr9,20s)(chr6,26s)(chr7,27s)(chr8,22s)(chr10,19s)(chr14,13s)(chr15,13s)(chr13,18s)(chr12,20s)(chr11,22s)(chr16,15s)(chr19,11s)(chr18,13s)(chr17,16s)(chr20,13s)(chrY,6s)(chrX,9s)(chr22,13s)(chr21,14s)
parsing total:  97s
merge results ... 2s
writeResult SNP ... 14s
write SV Result ... 2s
write mod Result ... 3s

total process: 154s

So I think the problem are the input files. I will try to generate a small test case and share it with you.

leonschuetz commented 1 month ago

I managed to get a small test case which will lead to the error (at least on our machine):

testcase

This will lead to the error when I use the genome I liked before (GRCh38)

reading reference ... [E::fai_load3_core] Failed to open FASTA file /mnt/storage2/megSAP/data/genomes/GRCh38.fa
twolinin commented 4 weeks ago

@leonschuetz Thank you for providing the file, which helped us quickly identify the cause of the error. The issue was that we did not account for the possibility of the reference containing too many chromosomes/contigs, which caused insufficient buffer space when reading the fasta index. We have now changed it to read the fasta index only once. You can download and test the latest version 1.7.3.

v1.7.2

./longphase_linux-x64 phase -r GRCh38.fa -b hg001_1.bam -s hg001_snv.vcf.gz --ont
LongPhase Ver 1.7.2

--- File Parameter ---
SNP File      : hg001_snv.vcf.gz
SV  File      :
MOD File      :
REF File      : GRCh38.fa
Output Prefix : result
Generate Dot  : False
BAM File      : hg001_1.bam

--- Phasing Parameter ---
Seq Platform       : ONT
Phase Indel        : False
Distance Threshold : 300000
Connect Adjacent   : 20
Edge Threshold     : 0.7
Overlap Threshold  : 0.2
Mapping Quality    : 1
Variant Confidence : 0.75
ReadTag Confidence : 0.65

parsing VCF ... 0s
parsing SV VCF ... 0s
parsing Meth VCF ... 0s
reading reference ... [E::fai_load3_core] Failed to open FASTA file GRCh38.fa
Segmentation fault (core dumped)

v1.7.3

$ ./longphase_linux-x64 phase -r GRCh38.fa -b hg001_1.bam -s hg001_snv.vcf.gz --ont
LongPhase Ver 1.7.3

--- File Parameter ---
SNP File      : hg001_snv.vcf.gz
SV  File      :
MOD File      :
REF File      : GRCh38.fa
Output Prefix : result
Generate Dot  : False
BAM File      : hg001_1.bam

--- Phasing Parameter ---
Seq Platform       : ONT
Phase Indel        : False
Distance Threshold : 300000
Connect Adjacent   : 20
Edge Threshold     : 0.7
Overlap Threshold  : 0.2
Mapping Quality    : 1
Variant Confidence : 0.75
ReadTag Confidence : 0.65

parsing VCF ... 0s
parsing SV VCF ... 0s
parsing Meth VCF ... 0s
reading reference ... 0s
(chr1,0s)
parsing total:  0s
merge results ... 0s
writeResult SNP ... 0s

total process: 0s
leonschuetz commented 3 weeks ago

With 1.7.3 it works!

Thank You for the fix and the CRAm support.