Closed ghuls closed 3 years ago
Hi @ghuls ,
We plan to address this soon: we will support both print as a statement and as a function (the lack of print statement is something that I personally really hate in Python 3, so it will be supported as well).
argparse is certainly better, and is definitely planned for later releases. Currently we don't have enough manpower to implement that sooner (porting getopt was trivial, but argparse does a lots of runtime typing tricks that make porting it harder).
@inumanag Seq seems like an interesting language.
Is there a better way (than Kmers in a list) to check if a certain read matches the big list of Kmers with hamming distance of 1 or less?
The current version seems quite slow (27 seconds for the first 10000 reads).
Is there also a way to indicate failure for a function (e.g.: return None) or do you need to create a specific type?
# File with 737280 16-kmers.
bc_10x_sc_atac_whitelist_filename = '737K-cratac-v1.txt'
fastq_filename = 'test.fastq.gz'
def read_barcode_whitelist_from_file[K](barcode_whitelist_filename: str):
bc_whitelist = list[Kmer[16]]()
with open(barcode_whitelist_filename, 'r') as fh:
for bc_line in fh:
bc = Kmer[16](seq(bc_line.strip()))
bc_whitelist.append(bc)
return bc_whitelist
bc_whitelist = read_barcode_whitelist_from_file[Kmer[16]](bc_10x_sc_atac_whitelist_filename)
type bcMatch(matched: bool, mismatch_dist: int)
def bc_matches_whitelist(bc: Kmer[16], mismatch_dist: int):
min_hamming_dist = 16
for bc_in_whitelist in bc_whitelist:
hamming_dist = abs(bc_in_whitelist - bc)
if hamming_dist <= mismatch_dist:
return bcMatch(True, hamming_dist)
min_hamming_dist = min(min_hamming_dist, hamming_dist)
return bcMatch(False, min_hamming_dist)
nbr_reads = 0
nbr_bc_0_mismatches = 0
nbr_bc_1_mismatches = 0
for record in FASTQ(fastq_filename, gzip=True, validate=False, copy=True):
nbr_reads += 1
bc_match = bc_matches_whitelist(Kmer[16](record.seq), 1)
if bc_match.matched == True:
if bc_match.mismatch_dist == 0:
nbr_bc_0_mismatches += 1
else:
nbr_bc_1_mismatches += 1
if nbr_reads == 10000:
break
print nbr_reads, nbr_bc_0_mismatches, nbr_bc_1_mismatches
Hi @ghuls
1) We have optional[T]
type which can be either None
or some value (that you can access via ~var
). The next release will bring a None type more similar to Python's Optional[T]
.
2) How many barcodes do you have? Also, how are you compiling this program (what command do you use?)
seqc find_barcodes.seq
Also how can parallelism be added?
I assume gzipping (when reading a FASTQ file) happens in a separate thread.
Does seq allow running the read_barcode_whitelist_from_file function in parallel, and collect the results later?
Numba for example allows something like this with arrays: https://numba.readthedocs.io/en/stable/user/parallel.html
from numba import njit, prange
@njit(parallel=True)
def prange_test(A):
s = 0
# Without "parallel=True" in the jit-decorator
# the prange statement is equivalent to range
for i in prange(A.shape[0]):
s += A[i]
return s
Another thing I noticed is that if boolean_value is True:
is not supported (if boolean_value == True:
works fine, so not a big problem).
Hi,
Maybe a better approach is to flip this procedure; right now you iterate over (reads x barcodes), which is 737280 x 10000 > 7 billion Hamming distance calculations (maybe less by a factor of ~2 since you stop early).
Instead you could construct a set
of k-mers from the whitelist and iterate over Hamming neighbors, checking if each is in the set. This would be 10000 x (3x16 + 1) < 500k set lookups.
Here is an example of iterating over Hamming neighbors: https://docs.seq-lang.org/cookbook.html#k-mer-hamming-neighbors
In your case the code might look something like:
whitelist = { ... } # set of k-mers from whitelist
def bc_matches_whitelist(bc: Kmer[16]):
if bc in whitelist:
return 0 # exact match
for neighbor in neighbors(bc):
if neighbor in whitelist:
return 1 # Hamming neighbor in whitelist; distance = 1
return -1 # no match
Let me know if this makes sense!
P.S. Just if boolean_value:
is probably the most idiomatic way to express that condition.
Thanks a lot.
Now it only takes 45 seconds for the whole FASTQ file (15 milion reads). My original code was still running from this morning on the whole file.
I got the code parallelized, but I assumed it would run each FASTQ file in a separate thread (+ a thread for gunzipping the FASTQ file), but it seems to work slightly different.
As you can see in the end, setting OMP_NUM_THREADS
to different values gives almost the same wall time but user time multiplied (e.g. for 8 and 16). What are those 8 threads doing (as there are no more than 8 fastq files). With OMP_NUM_THREADS=4
the user time is quite comparable with OMP_NUM_THREADS=8
(even a bit shorted), but with OMP_NUM_THREADS=16
it is almost twice as long while the wall time is the same.
Can the number of threads be controlled from seq
itself (e.g. depending on the number of fastq files)?
from sys import argv
bc_10x_sc_atac_whitelist_filename = '737K-cratac-v1.txt'
def read_barcode_whitelist_from_file[K](bc_whitelist_filename: str):
"""
Read whitelisted barcodes from file and convert to a set of Kmers.
"""
bc_whitelist = set[K]()
with open(bc_whitelist_filename, 'r') as fh_bc:
for bc_line in fh_bc:
bc_str = bc_line.strip()
if len(bc_str) == 0 or bc_str[0] == '#':
# Skip empty lines and lines that start with a comment.
continue
bc = K(seq(bc_str))
bc_whitelist.add(bc)
return bc_whitelist
bc_whitelist = read_barcode_whitelist_from_file[Kmer[16]](bc_10x_sc_atac_whitelist_filename)
def neighbors(kmer):
"""
Create kmers with hamming distance of 1.
"""
for i in range(len(kmer)):
for b in (k'A', k'C', k'G', k'T'):
if kmer[i] != b:
yield kmer |> base(i, b)
def bc_matches_whitelist(bc: Kmer[16]) -> int:
if bc in bc_whitelist:
# Exact match.
return 0
for neighbor in neighbors(bc):
if neighbor in bc_whitelist:
# Hamming neighbor in whitelist (distance = 1).
return 1
# No match.
return -1
def find_bc_in_fastq(fastq_filename):
nbr_reads = 0
nbr_bc_0_mismatches = 0
nbr_bc_1_mismatches = 0
for record in FASTQ(fastq_filename, gzip=True, validate=False, copy=True):
nbr_reads += 1
bc_match = bc_matches_whitelist(Kmer[16](record.seq))
if bc_match == 0:
nbr_bc_0_mismatches += 1
elif bc_match == 1:
nbr_bc_1_mismatches += 1
#if nbr_reads == 100000:
# break
print f'{fastq_filename}:\nnbr_reads:\t{nbr_reads}\ntotal_bc_found\t{nbr_bc_0_mismatches + nbr_bc_1_mismatches}\nnbr_bc_0_mismatches\t{nbr_bc_0_mismatches}\nnbr_bc_1_mismatches\t{nbr_bc_1_mismatches}\n'
if len(argv) == 1:
print 'Usage: {argv[0]} fastq_with_bc_file [fastq_with_bc_file ...]'
else:
iter(argv[1:]) ||> find_bc_in_fastq
Running it like this:
time OMP_NUM_THREADS=4 LD_LIBRARY_PATH=/software/seq/lib/seq seqc 10X_scATAC.seq ATAC_*_R2_001.fastq.gz
ATAC_S26_L002_R2_001.fastq.gz:
nbr_reads: 15144411
total_bc_found 672535
nbr_bc_0_mismatches 158421
nbr_bc_1_mismatches 514114
ATAC_S23_L002_R2_001.fastq.gz:
nbr_reads: 15380468
total_bc_found 696047
nbr_bc_0_mismatches 174185
nbr_bc_1_mismatches 521862
ATAC_S23_L001_R2_001.fastq.gz:
nbr_reads: 15443856
total_bc_found 700114
nbr_bc_0_mismatches 175553
nbr_bc_1_mismatches 524561
ATAC_S24_L001_R2_001.fastq.gz:
nbr_reads: 21842054
total_bc_found 953816
nbr_bc_0_mismatches 215077
nbr_bc_1_mismatches 738739
ATAC_S26_L001_R2_001.fastq.gz:
nbr_reads: 15176469
total_bc_found 674257
nbr_bc_0_mismatches 159440
nbr_bc_1_mismatches 514817
ATAC_S25_L001_R2_001.fastq.gz:
nbr_reads: 16592075
total_bc_found 742674
nbr_bc_0_mismatches 181133
nbr_bc_1_mismatches 561541
ATAC_S24_L002_R2_001.fastq.gz:
nbr_reads: 21818614
total_bc_found 950152
nbr_bc_0_mismatches 213626
nbr_bc_1_mismatches 736526
ATAC_S25_L002_R2_001.fastq.gz:
nbr_reads: 16543558
total_bc_found 741736
nbr_bc_0_mismatches 180998
nbr_bc_1_mismatches 560738
# With OMP_NUM_THREADS=4
real 1m52.256s
user 7m28.170s
sys 0m5.939s
# With OMP_NUM_THREADS=6
real 1m35.313s
user 9m16.558s
sys 0m7.383s
# With OMP_NUM_THREADS=8
real 1m9.729s
user 8m49.437s
sys 0m6.365s
# With OMP_NUM_THREADS=16
real 1m9.140s
user 17m6.616s
sys 0m12.671s
# With OMP_NUM_THREADS=36
real 1m15.074s
user 40m47.396s
sys 0m21.305s
There's no simple way right now to set the number of threads programmatically, although it can technically be done by calling the omp_set_num_threads
OpenMP C function:
cimport omp_set_num_threads(i32)
omp_set_num_threads(i32(num_threads)) # num_threads is an int
The code you posted definitely looks reasonable to me, and I'd expect it to be amenable to parallelization. How many cores/threads does the machine you're running on have?
@ghuls : Thanks for spotting this!
Two things:
argv
, and based on your example, you only have 8 incoming files, so 16 threads won't be speeding up anything. You can also parallelize bc_matches_whitelist
(you can nest parallel pipes, and we have blocks
function that partitions the FASTQ in a large enough chunks exactly for this purpose). However, it should not be slower w/ 16 threads... The machine has 36 cores, so if I don't specify OMP_NUM_THREADS
, it will run with 36 (and it will be slower than with 8 and use 40 minutes of user time instead of 9 minutes. I assumed seq would figure out it should only use 8 threads.
How would I parallelize bc_matches_whitelist
in a useful way?
The I/0 should be pretty fast. The node I am using uses data from GPFS over infiniband.
How to use define nbr_reads
, nbr_bc_0_mismatches
and nbr_bc_1_mismatches
variable per fastq file (similar to global, but only for a specific function (find_bc_in_fastq
))?
I also assume that just defining a variable global is not thread safe (e.g. when modifying it inside process
). What is the best way to ensure thread safety for global or per function variables when being written from from multiple threads? @atomic
, lock or read lock? Can @atomic
only be used on a function or also for just one line? Should @atomic
just be used for one variable a the time, or is it ok to use for updating multiple variables?
nbr_reads = 0
nbr_bc_0_mismatches = 0
nbr_bc_1_mismatches = 0
def find_bc_in_fastq(fastq_filename: str):
# nbr_reads = 0
# nbr_bc_0_mismatches = 0
# nbr_bc_1_mismatches = 0
global nbr_reads
global nbr_bc_0_mismatches
global nbr_bc_1_mismatches
#for record in FASTQ(fastq_filename, gzip=True, validate=False, copy=True):
def process(record: FASTQRecord):
@atomic
def update_counters(bc_match: int):
global nbr_reads
global nbr_bc_0_mismatches
global nbr_bc_1_mismatches
nbr_reads += 1
if bc_match == 0:
nbr_bc_0_mismatches += 1
elif bc_match == 1:
nbr_bc_1_mismatches += 1
bc_match = bc_matches_whitelist(Kmer[16](record.seq))
update_counters(bc_match)
#if nbr_reads == 100000:
# break
FASTQ(fastq_filename, gzip=True, validate=False, copy=True) |> blocks(size=1000) ||> iter |> process
print f'{fastq_filename}:\nnbr_reads:\t{nbr_reads}\ntotal_bc_found\t{nbr_bc_0_mismatches + nbr_bc_1_mismatches}\nnbr_bc_0_mismatches\t{nbr_bc_0_mismatches}$
if len(argv) == 1:
stderr.write('Usage: {argv[0]} fastq_with_bc_file [fastq_with_bc_file ...]\n')
exit(1)
else:
iter(argv[1:]) ||> find_bc_in_fastq
Thanks.
I just played a bit around with parsing of a BAM file and noticed that the example is lacking some info: https://docs.seq-lang.org/cookbook.html#reading-sam-bam-cram
BAM(bam_file).contig[r.tid]
to get the contig name.r.AUX(tag_name)
and will result in a segfault. r.AUX(tag_name).i
and r.AUX(tag_name).f
works.Also is there a way to get the type from a tag, so the right function can be called on the tag?
Can bam_endpos
be exposed in c_htslib (and in bam
) as it is very useful to get the genomic end position of a read:
https://github.com/samtools/htslib/blob/19c189438f852e6e62dbda73f854d465cebb3d9f/sam.c#L343-L349
$ samtools view sample.bam | head -n 3
NB501171:585:HLM2TBGXG:1:22108:8631:16195 0 chr1 3005850 255 91M * 0 0 CCAATCCAGCAGTGGCACCTGTGCACAACAACAAACAATAGCTTGACACAATAACCCTAAGGGTATACTAGAGACATGCAGTAACTATGAT AAA/AAEE6EAEE//E//<E//E/E//E//E</EE/A/EA/<//E///</E/EAE////A///EE<</E/////EE//<<6///<<E///6 NH:i:1 HI:i:1 AS:i:77 nM:i:6 CR:Z:ACGACATTAA_AAGGTGTGGT_TACGACGATG CY:Z:A//6AA//EA_EE//E//A/E_AE/EEAE<<6 UR:Z:CCCGGGGGCA UY:Z:AAAAAAAAAA
NB501171:585:HLM2TBGXG:3:21410:20944:11818 0 chr1 3005850 255 91M * 0 0 CCAATCCAGCTGTGGCACCTGTGGACCACAACAAAGACTAGCTTGACACAATAACCCTAAGGGTATACTAGTGACATGCAGTAACTATGAT AAAAAEEAEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEAEAEEAAEEAEEAA/EEEEEEEEEAAEAE6EE<E<AEAA//E<EE/E/ NH:i:1 HI:i:1 AS:i:89 nM:i:0 CR:Z:AGGACATGAA_AAGGTGTGGT_TACGACGATG CY:Z:AAAAAEEEEE_EEEEEEEEEE_EEEAEE<EEE UR:Z:CAAGACGCGC UY:Z:AAAAAAAAAA CB:Z:AGGACATGAA_AAGGTGTGGT_TACGACGATG UB:Z:CAAGACGCGC
NB501171:585:HLM2TBGXG:3:22408:2332:9008 0 chr1 3005850 255 91M * 0 0 CCAATCCAGCTGTGGCACCTGTGGACCACAACAAAGACTAGCTTGACACAATAACCCTAAGGGTATACTAGTGACATGCAGTAACTATGAT AAAAAEEEEEEEEEEE/EEEEEEEEAE/EEEEEA/EEEEEEE<<E/EEEEEEE/EA<AEEEEAEEEEEEEEEEEE/EEE/<AEAAEE<AEE NH:i:1 HI:i:1 AS:i:89 nM:i:0 CR:Z:AGGACATGAA_AAGGTGTGGT_TACGACGATG CY:Z:AAAAAEEEEE_EEEE<EEEEE_EEEAEEEEE< UR:Z:CGCACAGGGC UY:Z:AAAAAAAAAA CB:Z:AGGACATGAA_AAGGTGTGGT_TACGACGATG UB:Z:CGCACAGGGC
bam_filename = 'sample.bam'
nbr_reads = 0
# Assign BAM() to bam, so we can get the contig later.
bam = BAM(bam_filename)
# Does not work if we want to print the contig name.
# for r in BAM(bam_filename):
for r in bam:
print r.name
print r.tid
print bam.contig(r.tid)
print r.locus
print r.pos
print r.read
print r.mapq
print r.cigar
print r.read1
print r.read2
print r.duplicate
print r.reversed
print r.aux('CR')
# Segfault as AS is an integer
#print r.aux('AS')
# Works like this.
print r.aux('AS').i
# Segfaults if CB tag is not available for the current read.
print r.aux('CB')
# Works like this
cb = r.aux('CB')
if cb:
print f"CB {cb.Z}"
# etc.
print
nbr_reads += 1
if nbr_reads == 3:
break;
seqc read_bam.seq
NB501171:585:HLM2TBGXG:1:22108:8631:16195
194
chr1
Locus(tid=194, pos=3005849, reversed=False)
3005849
CCAATCCAGCAGTGGCACCTGTGCACAACAACAAACAATAGCTTGACACAATAACCCTAAGGGTATACTAGAGACATGCAGTAACTATGAT
255
91M
False
False
False
False
ACGACATTAA_AAGGTGTGGT_TACGACGATG
77
NB501171:585:HLM2TBGXG:3:21410:20944:11818
194
chr1
Locus(tid=194, pos=3005849, reversed=False)
3005849
CCAATCCAGCTGTGGCACCTGTGGACCACAACAAAGACTAGCTTGACACAATAACCCTAAGGGTATACTAGTGACATGCAGTAACTATGAT
255
91M
False
False
False
False
AGGACATGAA_AAGGTGTGGT_TACGACGATG
89
CB AGGACATGAA_AAGGTGTGGT_TACGACGATG
NB501171:585:HLM2TBGXG:3:22408:2332:9008
194
chr1
Locus(tid=194, pos=3005849, reversed=False)
3005849
CCAATCCAGCTGTGGCACCTGTGGACCACAACAAAGACTAGCTTGACACAATAACCCTAAGGGTATACTAGTGACATGCAGTAACTATGAT
255
91M
False
False
False
False
AGGACATGAA_AAGGTGTGGT_TACGACGATG
89
CB AGGACATGAA_AAGGTGTGGT_TACGACGATG
Most of these are quirks of htslib, which we use for parsing BAM/SAM. But,
SAMRecord
upon its creation to avoid having to do this lookup.Storing the contig name in the SAMRecord is not necessary for me.
I figured out a way to get type info from tags in htslib (see here to see how to convert the one byte type char to the correct type): https://github.com/samtools/htslib/blob/2264113e5df1946210828e45d29c605915bd3733/sam.c#L3648-L3726
Basically get a pointer to the result returned by: bam_aux_get(r.__raw__(), ptr[byte](__ptr__(tag_arr)))
and match that byte to get the correct type:
from bio.c_htslib import bam_endpos, bam_aux_get
bam_reader = BAM(bam_filename)
for r in bam_reader:
# Only use reads with are not marked as duplicate.
if not r.unmapped and not r.duplicate:
# Get cell barcode tag.
cb = r.aux('CB')
# Only use reads which have cell barcode tag.
if cb:
print "\n" + r.name
print(f'{bam_reader.contig(r.tid)}\t{r.pos}\t{int(bam_endpos(bam_reader.__raw__()))}\t{cb.Z}')
print 'CB:Z', ptr[byte](cb.s)[0]
print 'AS:i', ptr[byte](r.aux('AS').s)[0]
print 'HI:i', ptr[byte](r.aux('AS').s)[0]
Output: Z = str
, C = u8
NB501171:598:HVLM7BGXG:1:21305:5934:15468
chr2L 5062 5112 TTGTCTGACGACCGTCCGTCTCAACGTCAC
CB:Z Z
AS:i C
HI:i C
NB501171:598:HVLM7BGXG:2:21106:24765:9141
chr2L 5062 5112 AAGTAAGCACTAACCACCGCCGTTGTCAGC
CB:Z Z
AS:i C
HI:i C
NB501171:598:HVLM7BGXG:3:11511:7045:12139
chr2L 5062 5112 TAACACGAAGCTGATATTCATCTTCCAACT
CB:Z Z
AS:i C
HI:i C
Could you also add bam_endpos
function from HTSlib as it allows to calculate the most right end location based on the CIGAR string.
diff --git a/stdlib/bio/c_htslib.seq b/stdlib/bio/c_htslib.seq
index ea8b9a4..d8dd9c9 100644
--- a/stdlib/bio/c_htslib.seq
+++ b/stdlib/bio/c_htslib.seq
@@ -17,6 +17,7 @@ from LD cimport bam_read1(cobj, cobj) -> i32
from LD cimport bam_init1() -> cobj
from LD cimport bam_cigar2qlen(int, ptr[u32]) -> int
from LD cimport bam_cigar2rlen(int, ptr[u32]) -> int
+from LD cimport bam_endpos(cobj) -> i32
from LD cimport bam_aux_get(cobj, ptr[byte]) -> ptr[u8]
from LD cimport bam_aux2i(ptr[u8]) -> int
from LD cimport bam_aux2f(ptr[u8]) -> float
Or allow it to be calculated from a SAM record:
hts_pos_t bam_endpos(const bam1_t *b)
{
hts_pos_t rlen = (b->core.flag & BAM_FUNMAP)? 0 : bam_cigar2rlen(b->core.n_cigar, bam_get_cigar(b));
if (rlen == 0) rlen = 1;
return b->core.pos + rlen;
}
Ah, good to know! This should be very easy to add then.
bam_endpos
can also be added -- we'll include these in the next release.
Just pushed some additions related to your post above:
SAMRecord.endpos
, which calls HTSlib's bam_endpos
SAMAux.tag_type
, which returns a string representing the tag's type (e.g. 'int'
, 'float'
, 'array[int]'
etc.)SAMAux
, so it should no longer segfaultThese will all be included in the next release!
This comment should have B2f(index)
instead of B2i(index)
:
https://github.com/seq-lang/seq/blob/ee3d3d4e615c355d1091e76a49ebfe1a0d6065dd/stdlib/bio/bam.seq#L167
print
function landed in #132. print
statement is still supported as print
(note the space).
@inumanag This is not fixed yet:
This comment should have
B2f(index)
instead ofB2i(index)
:
Fixed now on #132 . Thanks.
Use python3 syntax: eg. for print. Now
print('a')
also prints(a)
instead of justa
.Also https://github.com/seq-lang/seq/blob/develop/stdlib/getopt.seq is from python 2.7 instead of python 3. argparse is a bit nice to use: https://docs.python.org/3/library/argparse.html