popgenmethods / SINGER

Sampling and inference of genealogies with recombination
MIT License
24 stars 4 forks source link

Incomplete File Output with SINGER #4

Closed santiago1234 closed 7 months ago

santiago1234 commented 8 months ago

Hi @YunDeng98 ,

I'm experiencing an issue with SINGER version singer-0.1.7-beta-linux-x86_64 while processing some human genomes from chromosome 22, specifically using the parallel execution feature. Below is the command I use to run SINGER:

parallel_singer -vcf {params.vcf} \
           -Ne 1e4 \
           -m 1.2e-8 \
           -output results/trees/mex_chr22 \
           -n 1000 \
           -thin 1

The process appears to proceed without issues until the execution reaches the last step, which involves the script convert_long_ARG.py.

Upon reaching this step, the following error is encountered:

Traceback (most recent call last):
  File "/data/users/smedina/projects/MxGentreesDemography/bin/singer-0.1.7-beta-linux-x86_64/convert_long_ARG.py", line 118, in <module>
    main()
  File "/data/users/smedina/projects/MxGentreesDemography/bin/singer-0.1.7-beta-linux-x86_64/convert_long_ARG.py", line 114, in main
    ts = read_long_ARG(node_files, branch_files, mutation_files, block_coordinates)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/smedina/projects/MxGentreesDemography/bin/singer-0.1.7-beta-linux-x86_64/convert_long_ARG.py", line 23, in read_long_ARG
    node_time = np.loadtxt(node_file)
                ^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/smedina/mambaforge/envs/coaldecoder/lib/python3.12/site-packages/numpy/lib/npyio.py", line 1373, in loadtxt
    arr = _read(fname, dtype=dtype, comment=comment, delimiter=delimiter,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/smedina/mambaforge/envs/coaldecoder/lib/python3.12/site-packages/numpy/lib/npyio.py", line 992, in _read
    fh = np.lib._datasource.open(fname, 'rt', encoding=encoding)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/smedina/mambaforge/envs/coaldecoder/lib/python3.12/site-packages/numpy/lib/_datasource.py", line 193, in open
    return ds.open(path, mode, encoding=encoding, newline=newline)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/smedina/mambaforge/envs/coaldecoder/lib/python3.12/site-packages/numpy/lib/_datasource.py", line 533, in open
    raise FileNotFoundError(f"{path} not found.")
FileNotFoundError: results/trees/mex_chr22_0_1_nodes_987.txt not found.
Processing segment 0

After inspecting the files with prefix: results/trees/mex_chr22_0_1_nodes

ls results/trees/mex_chr22_0_1_nodes*
# the result is
results/trees/mex_chr22_0_1_nodes_0.txt
...
results/trees/mex_chr22_0_1_nodes_142.txt

The file numbering ranges from 0 to 142, which suggests the program expects files numbered up to 999. However, upon checking files with other prefixes (e.g., results/trees/mex_chr22_30_31_nodes_*), I found that all the expected files are present.

Do you have insights or suggestions on what might be causing this issue?

Additionally, the README file appears to be incomplete at this section. Could you provide an example of how SINGER was run for the African genomes discussed in the paper?

thanks,

YunDeng98 commented 8 months ago

Hi @santiago1234 , thanks a lot for reporting this!

(1) I will finish writing the documentation for parallel_singer and sorry for not completing that;

(2) btw, I noticed that you used thinning of 1, which might to be short for effective mixing. I would suggest using -thin 20 instead;

(3) Could you do me a favor and check something like results/trees/mex_chr22_j_j+1_nodes_*.txt? Does it extend from 0-999 or also abort midway?

(4) In terms of African data, I used naive parallelization (I didn't have parallel_singer by then). I will give you one example of my job submissions:

#!/bin/bash
#SBATCH --account=co_songlab
#SBATCH --partition=savio2_htc
#SBATCH --time=16:00:00
#SBATCH --error=slurm_job_%j.err
#SBATCH --output=slurm_job_%j.out
#SBATCH --cpus-per-task=1

singer_master -Ne 2e4 -m 1.2e-8 -vcf ../african_chr1_cleaned -output ../singer_args/african_chr1_22_23 -start 22000000 -end 23000000 -n 2000 -thin 5

and I submitted a lot of jobs like so.

YunDeng98 commented 8 months ago

Hi @santiago1234 sorry I misunderstood your response, so you already checked all other files and only results/trees/mex_chr22_0_1_nodes_*.txt failed to extend from 0-999 and all other similar files did extend to 999?

santiago1234 commented 8 months ago

Hi @YunDeng98, I am counting the files and this is what i get:

ls mex_chr22_*_nodes_*.txt | awk -F '_' '{print $3 "_" $4}' |  sort | uniq -c

  144 0_1
   1001 10_11
   1001 11_12
   1001 12_13
   1001 13_14
   1001 14_15
   1001 15_16
   1001 16_17
   1001 17_18
   1001 18_19
   1001 19_20
    258 1_2
   1001 20_21
   1001 21_22
   1001 22_23
   1001 23_24
   1001 24_25
   1001 25_26
   1001 26_27
   1001 27_28
   1001 28_29
   1001 29_30
    495 2_3
   1001 30_31
   1001 31_32
   1001 32_33
   1001 33_34
   1001 34_35
   1001 35_36
   1001 36_37
   1001 37_38
   1001 3_4
   1001 4_5
   1001 5_6
   1001 6_7
   1001 7_8
   1001 8_9
   1001 9_10

So just a few prefixes: 2_3, 1_2, and 0_1.

YunDeng98 commented 8 months ago

Hi @santiago1234 I might know the reason why. Could you check how many variant sites are there in the first 3 Mb?

santiago1234 commented 7 months ago

The first 3Mb, which covers the region chr22:11211657-14211657, contains only 18 variants. I guess the reason that there are not too many variants is that I have used a mask to exclude low-quality regions.

YunDeng98 commented 7 months ago

Yes that is my guess exactly. SINGER doesn't like it when there are only a handful of variants and might abort, and it is not too meaningful to infer ARG anyways. I would recommend just re-run SINGER starting from 14Mb, or if you prefer to stick with your current results, I can guide you how to convert your partial current results into tskit.

santiago1234 commented 7 months ago

Oh, I see. I don't have a problem starting from 14Mb. Would you recommend excluding regions of the genome (for instance, 1Mb windows) where there are fewer than X number of variants? This way, for another chromosome, I wouldn't have to specify this start value beforehand.

thanks again

YunDeng98 commented 7 months ago

Right, just remove these 18 variants from the vcf and I think things will work from there. I will write a new parallel_singer which enforces X number of variants for a window to even run SINGER on it. Thanks a lot for your feedback!

santiago1234 commented 7 months ago

great, thanks!