Open subwaystation opened 1 year ago
[wfmash::map] Reference = [/data/HLA/DRB1-3123.fa.gz]
[wfmash::map] Query = [/data/HLA/DRB1-3123.fa.gz]
[wfmash::map] Kmer size = 19
[wfmash::map] Window size = 30
[wfmash::map] Segment length = 5000 (read split allowed)
[wfmash::map] Block length min = 25000
[wfmash::map] Chaining gap max = 100000
[wfmash::map] Percentage identity threshold = 70%
[wfmash::map] Skip self mappings
[wfmash::map] Mapping output file = /dev/stdout
[wfmash::map] Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
[wfmash::map] Execution threads = 16
[wfmash::skch::Sketch::build] minimizers picked from reference = 10970
[wfmash::skch::Sketch::index] unique minimizers = 2797
[wfmash::skch::Sketch::computeFreqHist] Frequency histogram of minimizers = (1, 19) ... (22, 1)
[wfmash::skch::Sketch::computeFreqHist] With threshold 0.001%, consider all minimizers during lookup.
[wfmash::map] time spent computing the reference index: 0.00974391 sec
[wfmash::skch::Map::mapQuery] mapped 0.00% @ 0.00e+00 bp/s elapsed: 00:00:00:00[wfmash::skch::Map::mapQuery] mapped 100.00% @ 3.26e+05 bp/s elapsed: 00:00:00:00 remain: 00:00:00:00
[wfmash::skch::Map::mapQuery] count of mapped reads = 11, reads qualified for mapping = 12, total input reads = 12, total input bp = 163416
[wfmash::map] time spent mapping the query: 5.02e-01 sec
[wfmash::map] mapping results saved in: /dev/stdout
wfmash -s 5000 -l 25000 -p 70 -n 11 -k 19 -H 0.001 -X -t 16 --tmp-base /data/HLA/outputtt /data/HLA/DRB1-3123.fa.gz --approx-map
0.09s user 0.02s system 24% cpu 0.51s total 63704Kb max memory
python3 /usr/local/bin/scripts/paf2net.py -p /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.mappings.wfmash.paf
0.03s user 0.00s system 97% cpu 0.04s total 10636Kb max memory
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-x0sxurwi because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Detected 2 communities.
python3 /usr/local/bin/scripts/net2communities.py -e /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.mappings.wfmash.paf.edges.list.txt -w /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.mappings.wfmash.paf.edges.weights.txt -n /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.mappings.wfmash.paf.vertices.id2name.txt --accurate-detection --output-prefix /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e
0.83s user 2.05s system 519% cpu 0.55s total 72840Kb max memory
samtools faidx /data/HLA/DRB1-3123.fa.gz gi|568815592:32578768-32589835 gi|568815551:3814534-3830133 gi|568815561:3988942-4004531 gi|568815567:3779003-3792415 gi|568815569:3979127-3993865 gi|28212469:126036-137103 gi|28212470:131613-146345 gi|157702218:147985-163915 gi|528476637:32549024-32560088
0.00s user 0.00s system 100% cpu 0.00s total 3368Kb max memory
samtools faidx /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.0.fa
0.00s user 0.00s system 20% cpu 0.01s total 3192Kb max memory
samtools faidx /data/HLA/DRB1-3123.fa.gz gi|568815529:3998044-4011446 gi|345525392:5000-18402 gi|29124352:124254-137656
0.00s user 0.00s system 50% cpu 0.00s total 3484Kb max memory
samtools faidx /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.1.fa
0.00s user 0.00s system 28% cpu 0.00s total 3128Kb max memory
pggb -i /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.0.fa \
-o /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.0.fa.out \
-p 5000 -l 25000 -p 70 -n 12 -K 19 -F 0.001 \
-k 19 -f 0 -B 10000000 \
-H 12 -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
--threads 16 --poa-threads 16
pggb -i /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.1.fa \
-o /data/HLA/outputtt/DRB1-3123.fa.gz.6deb21e.community.1.fa.out \
-p 5000 -l 25000 -p 70 -n 12 -K 19 -F 0.001 \
-k 19 -f 0 -B 10000000 \
-H 12 -j 0 -e 0 -G 700,900,1100 -P 1,19,39,3,81,1 -O 0.001 -d 100 -Q Consensus_ \
--threads 16 --poa-threads 16
Is this intended? I would expect 1 community.
With -p 98
several communities. But with -p 70
it should be one from my understanding.
That is a quite small (and short) dataset for applying community detection.
However, it doesn't seem like a bug. Those sequences are more similar to each other than all the others, so they form a community of their own. I suppose they tend to stay together also with different values of -p
.
I see. The longer the better. But how long? LPA? E. coli? Yeast? Should one already know, there are several chromosomes in the dataset? Can you somehow quantify when to apply the community detection?
Hi @AndreaGuarracino,
when I am running the default pggb with DRB1-3123 I do get one graph component aka one community:
However, running the
partition-before-pggb
script, I get 2 communites: