miwipe / ngsLCA

GNU General Public License v3.0
9 stars 5 forks source link

Segmentation fault (core dumped) #22

Closed maxibor closed 2 years ago

maxibor commented 2 years ago

As of f118505, after having succesfully compiled ngsLCA on Ubuntu 20.04, I get a segmentation fault and empty output files while running sam2lca on of the example file in the bam_files :

See log below:

$ ./ngsLCA -names ncbi_tax_dmp/names.dmp -nodes ncbi_tax_dmp/nodes.dmp -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid -bam bam_files/SPL_015_1444.fq.plastids.sorted.bam -outnames SPL_015_1444
    -> Will output lca results in file:     'SPL_015_1444.lca'
    -> [thread1] Will read header
    -> Will output lca weight in file:      'SPL_015_1444.wlca'
    -> Will output log info (problems) in file: 'SPL_015_1444.log'
    -> [thread1] Done reading header: 0.00 sec, header contains: 4322
Segmentation fault (core dumped)
miwipe commented 2 years ago

Hi Maxime, ngsLCA only accepts the nucl_gb.accession2taxid in a gz compressed format, but takes both names and nodes in compressed and uncompressed formats. Compress the nucl_gb.accession2taxid should fix this issue, I have updated the readme accordingly. Also I encourage you to add editdist or simscore options as the default setting will allowing up to 10 mismatches.

maxibor commented 2 years ago

Still no luck with the compressed nucl_gb.accession2taxid.gz gzip compressed 🤔

$ ./ngsLCA -names ncbi_tax_dmp/names.dmp -nodes ncbi_tax_dmp/nodes.dmp -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -bam bam_files/SPL_015_1444.fq.plastids.bam -outnames SPL_015_1444
    -> Will output lca results in file:     'SPL_015_1444.lca'
    -> [thread1] Will read header
    -> Will output lca weight in file:      'SPL_015_1444.wlca'
    -> Will output log info (problems) in file: 'SPL_015_1444.log'
    -> [thread1] Done reading header: 0.00 sec, header contains: 4322
Segmentation fault (core dumped)

And same error when trying with compressed nodes.dmp and names.dmp :

$ ./ngsLCA -editdistmin 0 -editdistmax 0 -names ncbi_tax_dmp/names.dmp.gz -nodes ncbi_tax_dmp/nodes.dmp.gz -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -bam bam_files/SPL_015_1444.fq.plastids.bam -outnames SPL_015_1444
    -> Will output lca results in file:     'SPL_015_1444.lca'
    -> [thread1] Will read header
    -> Will output lca weight in file:      'SPL_015_1444.wlca'
    -> Will output log info (problems) in file: 'SPL_015_1444.log'
    -> [thread1] Done reading header: 0.00 sec, header contains: 4322
Segmentation fault (core dumped)
maxibor commented 2 years ago

Dear @miwipe and @wyc661217, I'm afraid that the issue persists on the master branch as of today. Worse, after havint tested on three different computers, with three different system, I can confirm that I always encounter the same Segmentation fault, both after compiling ngsLCA following the README's instructions, or using the conda-recipe :( I tried with both compressed and uncompressed nodes and names files (gzip or bgzip), and gzip or bgzip compressed nucl_gb.accession2taxid, but unfortunately to no avail.

ANGSD commented 2 years ago

Ok Maxime, can you let me know the full error output and all your commands. Then I am sure we can fix this.

Best Thorfinn

On 15 Jul 2022, at 11.18, Maxime Borry @.***> wrote:

Dear @miwipe https://github.com/miwipe and @wyc661217 https://github.com/wyc661217, I'm afraid that the issue persists on the master branch as of today. Worse, after havint tested on three different computers, with three different system, I can confirm that I always encounter the same Segmentation fault, both after compiling ngsLCA following the README's instructions, or using the conda-recipe :( I tried with both compressed and uncompressed nodes and names files (gzip or bgzip), and gzip or bgzip compressed nucl_gb.accession2taxid, but unfortunately to no avail.

— Reply to this email directly, view it on GitHub https://github.com/miwipe/ngsLCA/issues/22#issuecomment-1185350021, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3UKUX46PHZY57IUOLTVUEUEZANCNFSM5R3RGWGA. You are receiving this because you are subscribed to this thread.

maxibor commented 2 years ago

The full command output is above @ANGSD :)

ANGSD commented 2 years ago

ok great. Do you also supply the input files so I can try to reproduce?

On 15 Jul 2022, at 12.51, Maxime Borry @.***> wrote:

The full command output is above @ANGSD https://github.com/ANGSD :)

— Reply to this email directly, view it on GitHub https://github.com/miwipe/ngsLCA/issues/22#issuecomment-1185426490, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3ROGYQCUGY6XR3BQALVUE7DFANCNFSM5R3RGWGA. You are receiving this because you were mentioned.

maxibor commented 2 years ago

This is one of the files you provide on your README: https://sid.erda.dk/share_redirect/AqTXI6E560/SPL_015_1444.fq.plastids.bam

The smallest one (that still needs to be sorted by readname) I've tested is SPL_153_7001 that you also provide on https://sid.erda.dk/share_redirect/AqTXI6E560

$ ngsLCA/ngsLCA -editdistmin 0 -editdistmax 0 -names ncbi_tax_dmp/names.dmp.gz -nodes ncbi_tax_dmp/nodes.dmp.gz -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -bam SPL_153_7001.fq.plastids.sorted.bam
    -> Will output lca results in file:     'outnames.lca'
    -> Will output lca weight in file:      'outnames.wlca'
    -> Will output log info (problems) in file: 'outnames.log'
    -> [thread1] Will read header
    -> [thread1] Done reading header: 0.00 sec, header contains: 4322
Segmentation fault (core dumped)
ANGSD commented 2 years ago

Hmm, it works for me.

See this:

% ./ngsLCA -names ncbi_tax_dmp/names.dmp -nodes ncbi_tax_dmp/nodes.dmp -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -bam sort.bam -outnames SPL_015_1444
    -> Will output lca results in file:     'SPL_015_1444.lca'
    -> [thread1] Will read header
    -> Will output lca weight in file:      'SPL_015_1444.wlca'
    -> Will output log info (problems) in file: 'SPL_015_1444.log'
    -> [thread1] Done reading header: 0.00 sec, header contains: 4322 
    -> -bam     sort.bam
    -> -names   ncbi_tax_dmp/names.dmp
    -> -nodes   ncbi_tax_dmp/nodes.dmp
    -> -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz
    -> -simscoreLow 0.000000
    -> -simscoreHigh    1.000000
    -> -editdistMin 0
    -> -editdistMax 10
    -> -outnames    SPL_015_1444
    -> -minmapq 0
    -> Starting to extract (acc->taxid) from binary file: 'ncbi_tax_dmp/nucl_gb.accession2taxid.gz'
    -> Checking if exits: 'sort.bamsort.bam.bin'
check if exists: 0 
    -> opening file: 'sort.bamsort.bam.bin' mode: 'rb'
    -> Setting threads to: 4 
    -> Number of entries to use from accesion to taxid: 0, time taken: 0.00 sec
    -> [ncbi_tax_dmp/names.dmp] Number of unique names (column1): 2431432 with third column 'scientific name'
    -> Number of unique names (column1): 2431432 from file: ncbi_tax_dmp/nodes.dmp
    -> Will add some fixes of the ncbi database due to merged names
[hts]   -> editMin:0 editmMax:10 scoreLow:0.000000 scoreHigh:1.000000 minlength:-1 discard: 516
    -> Number of species with reads that map uniquely: 0
    -> [ALL done] walltime used =  4.00 sec

Could you try to rename the bam file, the program build some tempfiles maybe these are corrupt.

maxibor commented 2 years ago

Still no luck @ANGSD :(

$ wget https://sid.erda.dk/share_redirect/AqTXI6E560/SPL_015_1444.fq.plastids.bam
$ samtools sort -n -@ 8 SPL_015_1444.fq.plastids.bam > sort.bam
$ ./ngsLCA/ngsLCA -names ncbi_tax_dmp/names.dmp -nodes ncbi_tax_dmp/nodes.dmp -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -bam sort.bam -outnames SPL_015_1444
    -> Will output lca results in file:     'SPL_015_1444.lca'
    -> [thread1] Will read header
    -> Will output lca weight in file:      'SPL_015_1444.wlca'
    -> Will output log info (problems) in file: 'SPL_015_1444.log'
    -> [thread1] Done reading header: 0.00 sec, header contains: 4322
Segmentation fault (core dumped)
$ head ncbi_tax_dmp/names.dmp
1   |   all |       |   synonym |
1   |   root    |       |   scientific name |
2   |   Bacteria    |   Bacteria <bacteria> |   scientific name |
2   |   bacteria    |       |   blast name  |
2   |   eubacteria  |       |   genbank common name |
2   |   Monera  |   Monera <bacteria>   |   in-part |
2   |   Procaryotae |   Procaryotae <bacteria>  |   in-part |
2   |   Prokaryotae |   Prokaryotae <bacteria>  |   in-part |
2   |   Prokaryota  |   Prokaryota <bacteria>   |   in-part |
2   |   prokaryote  |   prokaryote <bacteria>   |   in-part |
$ head ncbi_tax_dmp/nodes.dmp
1   |   1   |   no rank |       |   8   |   0   |   1   |   0   |   0   |   0   |   0|0 |       |       |       |   0   |   0   |   0   |
2   |   131567  |   superkingdom    |       |   0   |   0   |   11  |   0   |   0   |   0   |0| 0   |       |       |       |   0   |   0   |   1   |
6   |   335928  |   genus   |       |   0   |   1   |   11  |   1   |   0   |   1   |   0|0 |       |       |       |   0   |   0   |   1   |
7   |   6   |   species |   AC  |   0   |   1   |   11  |   1   |   0   |   1   |   1|0 |       |       |       |   1   |   0   |   1   |
9   |   32199   |   species |   BA  |   0   |   1   |   11  |   1   |   0   |   1   |   1|0 |       |       |       |   1   |   0   |   1   |
10  |   1706371 |   genus   |       |   0   |   1   |   11  |   1   |   0   |   1   |   0|0 |       |       |       |   0   |   0   |   1   |
11  |   1707    |   species |   CG  |   0   |   1   |   11  |   1   |   0   |   1   |   1|0 |   effective current name; |       |       |   1   |   0   |   1   |
13  |   203488  |   genus   |       |   0   |   1   |   11  |   1   |   0   |   1   |   0|0 |       |       |       |   0   |   0   |   1   |
14  |   13  |   species |   DT  |   0   |   1   |   11  |   1   |   0   |   1   |   1|0 |       |       |       |   1   |   0   |   1   |
16  |   32011   |   genus   |       |   0   |   1   |   11  |   1   |   0   |   1   |   0|0 |       |       |       |   0   |   0   |   1   |
$ zcat ncbi_tax_dmp/nucl_gb.accession2taxid.gz | head
accession   accession.version   taxid   gi
A00001  A00001.1    10641   58418
A00002  A00002.1    9913    2
A00003  A00003.1    9913    3
A00004  A00004.1    32630   57971
A00005  A00005.1    32630   57972
A00006  A00006.1    32630   57973
A00008  A00008.1    32630   57974
A00009  A00009.1    32630   57975
A00010  A00010.1    32630   57976

System info (though I tested on my Mac and on a GitPod machine and obtained the same result ):


$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:    20.04
Codename:   focal
$ $ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Stepping:                        7
CPU MHz:                         1000.250
CPU max MHz:                     3900.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        4600.00
Virtualization:                  VT-x
L1d cache:                       1 MiB
L1i cache:                       1 MiB
L2 cache:                        32 MiB
L3 cache:                        44 MiB
NUMA node0 CPU(s):               0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62
NUMA node1 CPU(s):               1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall n
                                 x pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dte
                                 s64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer ae
                                 s xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stib
                                 p ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a a
                                 vx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_oc
                                 cup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
maxibor commented 2 years ago

Config of the gitpod instance:

$ lscpu 
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7B13
Stepping:                        0
CPU MHz:                         2450.000
BogoMIPS:                        4900.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        4 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good 
                                 nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy c
                                 r8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx s
                                 map clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip rdpid fsrm
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       foca
ANGSD commented 2 years ago

Dear Maxime I have made some commits over the last days, none of the changes should change results or functionality. It is some changes to makefile, it now cleans up before program exits.

It now also prints which git commit tag and htslib version.

Can you please compare the output from below with your data, including md5s.

@. ngsLCA]$ md5sum ncbi_tax_dmp/names.dmp ncbi_tax_dmp/nodes.dmp ncbi_tax_dmp/nucl_gb.accession2taxid.gz .bam 2c471a8c8fa65a3ce8032287aad09dd6 ncbi_tax_dmp/names.dmp f11b659e48757df83100cd38e146263f ncbi_tax_dmp/nodes.dmp 00349a4ba80cb8f5e602962337fdbe5a ncbi_tax_dmp/nucl_gb.accession2taxid.gz f075a5c1b0bb2ba8d197a65977b6d8f4 SPL_015_1444.fq.plastids.bam ce0b3936eced33c291ce37258582e160 sort.bam **@. ngsLCA]$ make clean;make Crypto library is available to link; adding -lcrypto to LIBS HTSSRC not defined, assuming systemwide installation -lhts rm -f ngsLCA .o .d version.h Crypto library is available to link; adding -lcrypto to LIBS HTSSRC not defined, assuming systemwide installation -lhts echo '#define NGSLCA_VERSION "484abfc"' > version.h g++ -c -O3 -std=c++11 ngsLCA.cpp g++ -MM -O3 -std=c++11 ngsLCA.cpp >ngsLCA.d g++ -c -O3 -std=c++11 ngsLCA_format.cpp g++ -MM -O3 -std=c++11 ngsLCA_format.cpp >ngsLCA_format.d g++ -c -O3 -std=c++11 ngsLCA_cli.cpp g++ -MM -O3 -std=c++11 ngsLCA_cli.cpp >ngsLCA_cli.d g++ -O3 -std=c++11 -o ngsLCA .o -lz -lm -lbz2 -llzma -lpthread -lcurl -lcrypto -lhts **@. ngsLCA]$ ./ngsLCA -names ncbi_tax_dmp/names.dmp -nodes ncbi_tax_dmp/nodes.dmp -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -bam sort.bam -outnames SPL_015_1444 -> ngslca version: 484abfc (htslib: 1.15.1-45-g506f479) build(Jul 19 2022 16:51:16) -> Will output lca results in file: 'SPL_015_1444.lca' -> [thread1] Will read header -> [thread1] Done reading header: 0.00 sec, header contains: 4322 -> Will output lca weight in file: 'SPL_015_1444.wlca' -> Will output log info (problems) in file: 'SPL_015_1444.log' -> -bam sort.bam -> -names ncbi_tax_dmp/names.dmp -> -nodes ncbi_tax_dmp/nodes.dmp -> -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -> -simscoreLow 0.000000 -> -simscoreHigh 1.000000 -> -editdistMin 0 -> -editdistMax 10 -> -outnames SPL_015_1444 -> -minmapq 0 -> Starting to extract (acc->taxid) from binary file: 'ncbi_tax_dmp/nucl_gb.accession2taxid.gz' -> Checking if exits: 'nucl_gb.accession2taxid.gzsort.bam.bin' check if exists: 0 -> opening file: 'nucl_gb.accession2taxid.gzsort.bam.bin' mode: 'rb' -> Setting threads to: 4 -> Number of entries to use from accesion to taxid: 4313, time taken: 0.00 sec -> [ncbi_tax_dmp/names.dmp] Number of unique names (column1): 2432197 with third column 'scientific name' -> Number of unique names (column1): 2432197 from file: ncbi_tax_dmp/nodes.dmp -> Will add some fixes of the ncbi database due to merged names [hts] -> editMin:0 editmMax:10 scoreLow:0.000000 scoreHigh:1.000000 minlength:-1 discard: 516 -> Number of species with reads that map uniquely: 139 -> [ALL done] walltime used = 19.00 sec @. ngsLCA]$

On 18 Jul 2022, at 16.10, Maxime Borry @.***> wrote:

Config of the gitpod https://gitpod.io/ instance:

$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 48 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 25 Model: 1 Model name: AMD EPYC 7B13 Stepping: 0 CPU MHz: 2450.000 BogoMIPS: 4900.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 256 KiB L1i cache: 256 KiB L2 cache: 4 MiB L3 cache: 32 MiB NUMA node0 CPU(s): 0-15 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy c r8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx s map clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip rdpid fsrm $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.4 LTS Release: 20.04 Codename: foca — Reply to this email directly, view it on GitHub https://github.com/miwipe/ngsLCA/issues/22#issuecomment-1187531914, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3UTLZTRWETNDO7T2CLVUVQUPANCNFSM5R3RGWGA. You are receiving this because you were mentioned.

maxibor commented 2 years ago

@ANGSD It works :)

$ ./ngsLCA/ngsLCA -names ncbi_tax_dmp/names.dmp -nodes ncbi_tax_dmp/nodes.dmp -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -bam sorted.bam -outnames SPL_015_1444
        -> ngslca version: 484abfc (htslib: 1.15.1-51-g1fba06c) build(Jul 20 2022 11:26:49)
        -> Will output lca results in file:             'SPL_015_1444.lca'
        -> [thread1] Will read header
        -> Will output lca weight in file:              'SPL_015_1444.wlca'
        -> Will output log info (problems) in file:     'SPL_015_1444.log'
        -> [thread1] Done reading header: 0.00 sec, header contains: 4322 
        -> -bam         sorted.bam
        -> -names       ncbi_tax_dmp/names.dmp
        -> -nodes       ncbi_tax_dmp/nodes.dmp
        -> -acc2tax     ncbi_tax_dmp/nucl_gb.accession2taxid.gz
        -> -simscoreLow 0.000000
        -> -simscoreHigh        1.000000
        -> -editdistMin 0
        -> -editdistMax 10
        -> -outnames    SPL_015_1444
        -> -minmapq     0
        -> Starting to extract (acc->taxid) from binary file: 'ncbi_tax_dmp/nucl_gb.accession2taxid.gz'
        -> Checking if exits: 'nucl_gb.accession2taxid.gzsorted.bam.bin'
check if exists: 1 
        -> opening file: 'nucl_gb.accession2taxid.gzsorted.bam.bin' mode: 'wb'
        -> Setting threads to: 4 
        -> opening file: 'ncbi_tax_dmp/nucl_gb.accession2taxid.gz' mode: 'rb'
        -> Setting threads to: 2 
        -> At linenr: 304100001 in 'ncbi_tax_dmp/nucl_gb.accession2taxid.gz'            -> Number of entries to use from accesion to taxid: 4313, time taken: 65.00 sec
        -> [ncbi_tax_dmp/names.dmp] Number of unique names (column1): 2432161 with third column 'scientific name'
        -> Number of unique names (column1): 2432161 from file: ncbi_tax_dmp/nodes.dmp
        -> Will add some fixes of the ncbi database due to merged names
[hts]   -> editMin:0 editmMax:10 scoreLow:0.000000 scoreHigh:1.000000 minlength:-1 discard: 516
        -> Number of species with reads that map uniquely: 139
        -> [ALL done] walltime used =  85.00 sec
ANGSD commented 2 years ago

Thats great.

Best regards

On 20 Jul 2022, at 13.30, Maxime Borry @.***> wrote:

@ANGSD https://github.com/ANGSD It works :)

$ ./ngsLCA/ngsLCA -names ncbi_tax_dmp/names.dmp -nodes ncbi_tax_dmp/nodes.dmp -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -bam sorted.bam -outnames SPL_015_1444 -> ngslca version: 484abfc (htslib: 1.15.1-51-g1fba06c) build(Jul 20 2022 11:26:49) -> Will output lca results in file: 'SPL_015_1444.lca' -> [thread1] Will read header -> Will output lca weight in file: 'SPL_015_1444.wlca' -> Will output log info (problems) in file: 'SPL_015_1444.log' -> [thread1] Done reading header: 0.00 sec, header contains: 4322 -> -bam sorted.bam -> -names ncbi_tax_dmp/names.dmp -> -nodes ncbi_tax_dmp/nodes.dmp -> -acc2tax ncbi_tax_dmp/nucl_gb.accession2taxid.gz -> -simscoreLow 0.000000 -> -simscoreHigh 1.000000 -> -editdistMin 0 -> -editdistMax 10 -> -outnames SPL_015_1444 -> -minmapq 0 -> Starting to extract (acc->taxid) from binary file: 'ncbi_tax_dmp/nucl_gb.accession2taxid.gz' -> Checking if exits: 'nucl_gb.accession2taxid.gzsorted.bam.bin' check if exists: 1 -> opening file: 'nucl_gb.accession2taxid.gzsorted.bam.bin' mode: 'wb' -> Setting threads to: 4 -> opening file: 'ncbi_tax_dmp/nucl_gb.accession2taxid.gz' mode: 'rb' -> Setting threads to: 2 -> At linenr: 304100001 in 'ncbi_tax_dmp/nucl_gb.accession2taxid.gz' -> Number of entries to use from accesion to taxid: 4313, time taken: 65.00 sec -> [ncbi_tax_dmp/names.dmp] Number of unique names (column1): 2432161 with third column 'scientific name' -> Number of unique names (column1): 2432161 from file: ncbi_tax_dmp/nodes.dmp -> Will add some fixes of the ncbi database due to merged names [hts] -> editMin:0 editmMax:10 scoreLow:0.000000 scoreHigh:1.000000 minlength:-1 discard: 516 -> Number of species with reads that map uniquely: 139 -> [ALL done] walltime used = 85.00 sec — Reply to this email directly, view it on GitHub https://github.com/miwipe/ngsLCA/issues/22#issuecomment-1190158656, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQOR3RQZNW5LCVHODBWTM3VU7PLXANCNFSM5R3RGWGA. You are receiving this because you were mentioned.