seq-lang / seq

A high-performance, Pythonic language for bioinformatics
https://seq-lang.org
Apache License 2.0
697 stars 50 forks source link

Use python3 syntax: eg. for print #153

Closed ghuls closed 3 years ago

ghuls commented 4 years ago

Use python3 syntax: eg. for print. Now print('a') also prints (a) instead of just a.

Also https://github.com/seq-lang/seq/blob/develop/stdlib/getopt.seq is from python 2.7 instead of python 3. argparse is a bit nice to use: https://docs.python.org/3/library/argparse.html

inumanag commented 4 years ago

Hi @ghuls ,

We plan to address this soon: we will support both print as a statement and as a function (the lack of print statement is something that I personally really hate in Python 3, so it will be supported as well).

argparse is certainly better, and is definitely planned for later releases. Currently we don't have enough manpower to implement that sooner (porting getopt was trivial, but argparse does a lots of runtime typing tricks that make porting it harder).

ghuls commented 4 years ago

@inumanag Seq seems like an interesting language.

Is there a better way (than Kmers in a list) to check if a certain read matches the big list of Kmers with hamming distance of 1 or less?

The current version seems quite slow (27 seconds for the first 10000 reads).

Is there also a way to indicate failure for a function (e.g.: return None) or do you need to create a specific type?

# File with 737280 16-kmers.
bc_10x_sc_atac_whitelist_filename = '737K-cratac-v1.txt'

fastq_filename = 'test.fastq.gz'

def read_barcode_whitelist_from_file[K](barcode_whitelist_filename: str):
    bc_whitelist = list[Kmer[16]]()

    with open(barcode_whitelist_filename, 'r') as fh:
        for bc_line in fh:
            bc = Kmer[16](seq(bc_line.strip()))
            bc_whitelist.append(bc)

    return bc_whitelist

bc_whitelist = read_barcode_whitelist_from_file[Kmer[16]](bc_10x_sc_atac_whitelist_filename)

type bcMatch(matched: bool, mismatch_dist: int)

def bc_matches_whitelist(bc: Kmer[16], mismatch_dist: int):
    min_hamming_dist = 16

    for bc_in_whitelist in bc_whitelist:
        hamming_dist = abs(bc_in_whitelist - bc)

        if hamming_dist <= mismatch_dist:
            return bcMatch(True, hamming_dist)

        min_hamming_dist = min(min_hamming_dist, hamming_dist)

    return bcMatch(False, min_hamming_dist)

nbr_reads = 0
nbr_bc_0_mismatches = 0
nbr_bc_1_mismatches = 0

for record in FASTQ(fastq_filename, gzip=True, validate=False, copy=True):
    nbr_reads += 1

    bc_match = bc_matches_whitelist(Kmer[16](record.seq), 1)

    if bc_match.matched == True:
        if bc_match.mismatch_dist == 0:
            nbr_bc_0_mismatches += 1
        else:
            nbr_bc_1_mismatches += 1

    if nbr_reads == 10000:
        break

print nbr_reads, nbr_bc_0_mismatches, nbr_bc_1_mismatches
inumanag commented 4 years ago

Hi @ghuls

1) We have optional[T] type which can be either None or some value (that you can access via ~var). The next release will bring a None type more similar to Python's Optional[T]. 2) How many barcodes do you have? Also, how are you compiling this program (what command do you use?)

ghuls commented 4 years ago
  1. Thanks for that info.
  2. I have 737280 barcodes. I run the program like this: seqc find_barcodes.seq

Also how can parallelism be added?

I assume gzipping (when reading a FASTQ file) happens in a separate thread.

Does seq allow running the read_barcode_whitelist_from_file function in parallel, and collect the results later?

Numba for example allows something like this with arrays: https://numba.readthedocs.io/en/stable/user/parallel.html

from numba import njit, prange

@njit(parallel=True)
def prange_test(A):
    s = 0
    # Without "parallel=True" in the jit-decorator
    # the prange statement is equivalent to range
    for i in prange(A.shape[0]):
        s += A[i]
    return s

Another thing I noticed is that if boolean_value is True: is not supported (if boolean_value == True: works fine, so not a big problem).

arshajii commented 4 years ago

Hi,

Maybe a better approach is to flip this procedure; right now you iterate over (reads x barcodes), which is 737280 x 10000 > 7 billion Hamming distance calculations (maybe less by a factor of ~2 since you stop early).

Instead you could construct a set of k-mers from the whitelist and iterate over Hamming neighbors, checking if each is in the set. This would be 10000 x (3x16 + 1) < 500k set lookups.

Here is an example of iterating over Hamming neighbors: https://docs.seq-lang.org/cookbook.html#k-mer-hamming-neighbors

In your case the code might look something like:

whitelist = { ... }  # set of k-mers from whitelist

def bc_matches_whitelist(bc: Kmer[16]):
    if bc in whitelist:
        return 0  # exact match
    for neighbor in neighbors(bc):
        if neighbor in whitelist:
            return 1  # Hamming neighbor in whitelist; distance = 1
    return -1  # no match

Let me know if this makes sense!

P.S. Just if boolean_value: is probably the most idiomatic way to express that condition.

ghuls commented 4 years ago

Thanks a lot.

Now it only takes 45 seconds for the whole FASTQ file (15 milion reads). My original code was still running from this morning on the whole file.

ghuls commented 4 years ago

I got the code parallelized, but I assumed it would run each FASTQ file in a separate thread (+ a thread for gunzipping the FASTQ file), but it seems to work slightly different.

As you can see in the end, setting OMP_NUM_THREADS to different values gives almost the same wall time but user time multiplied (e.g. for 8 and 16). What are those 8 threads doing (as there are no more than 8 fastq files). With OMP_NUM_THREADS=4 the user time is quite comparable with OMP_NUM_THREADS=8 (even a bit shorted), but with OMP_NUM_THREADS=16 it is almost twice as long while the wall time is the same.

Can the number of threads be controlled from seq itself (e.g. depending on the number of fastq files)?

from sys import argv

bc_10x_sc_atac_whitelist_filename = '737K-cratac-v1.txt'

def read_barcode_whitelist_from_file[K](bc_whitelist_filename: str):
    """
    Read whitelisted barcodes from file and convert to a set of Kmers.
    """

    bc_whitelist = set[K]()

    with open(bc_whitelist_filename, 'r') as fh_bc:
        for bc_line in fh_bc:
            bc_str = bc_line.strip()

            if len(bc_str) == 0 or bc_str[0] == '#':
                # Skip empty lines and lines that start with a comment.
                continue

            bc = K(seq(bc_str))
            bc_whitelist.add(bc)

    return bc_whitelist

bc_whitelist = read_barcode_whitelist_from_file[Kmer[16]](bc_10x_sc_atac_whitelist_filename)

def neighbors(kmer):
    """
    Create kmers with hamming distance of 1.
    """

    for i in range(len(kmer)):
        for b in (k'A', k'C', k'G', k'T'):
            if kmer[i] != b:
                yield kmer |> base(i, b)

def bc_matches_whitelist(bc: Kmer[16]) -> int:
    if bc in bc_whitelist:
        # Exact match.
        return 0
    for neighbor in neighbors(bc):
        if neighbor in bc_whitelist:
            # Hamming neighbor in whitelist (distance = 1).
            return 1

    # No match.
    return -1

def find_bc_in_fastq(fastq_filename):

    nbr_reads = 0
    nbr_bc_0_mismatches = 0
    nbr_bc_1_mismatches = 0

    for record in FASTQ(fastq_filename, gzip=True, validate=False, copy=True):
        nbr_reads += 1

        bc_match = bc_matches_whitelist(Kmer[16](record.seq))

        if bc_match == 0:
            nbr_bc_0_mismatches += 1
        elif bc_match == 1:
            nbr_bc_1_mismatches += 1

        #if nbr_reads == 100000:
        #    break

    print f'{fastq_filename}:\nnbr_reads:\t{nbr_reads}\ntotal_bc_found\t{nbr_bc_0_mismatches + nbr_bc_1_mismatches}\nnbr_bc_0_mismatches\t{nbr_bc_0_mismatches}\nnbr_bc_1_mismatches\t{nbr_bc_1_mismatches}\n'

if len(argv) == 1:
    print 'Usage: {argv[0]} fastq_with_bc_file [fastq_with_bc_file ...]'
else:
    iter(argv[1:]) ||> find_bc_in_fastq

Running it like this:

time OMP_NUM_THREADS=4 LD_LIBRARY_PATH=/software/seq/lib/seq seqc 10X_scATAC.seq ATAC_*_R2_001.fastq.gz

ATAC_S26_L002_R2_001.fastq.gz:
nbr_reads:      15144411
total_bc_found  672535
nbr_bc_0_mismatches     158421
nbr_bc_1_mismatches     514114

ATAC_S23_L002_R2_001.fastq.gz:
nbr_reads:      15380468
total_bc_found  696047
nbr_bc_0_mismatches     174185
nbr_bc_1_mismatches     521862

ATAC_S23_L001_R2_001.fastq.gz:
nbr_reads:      15443856
total_bc_found  700114
nbr_bc_0_mismatches     175553
nbr_bc_1_mismatches     524561

ATAC_S24_L001_R2_001.fastq.gz:
nbr_reads:      21842054
total_bc_found  953816
nbr_bc_0_mismatches     215077
nbr_bc_1_mismatches     738739

ATAC_S26_L001_R2_001.fastq.gz:
nbr_reads:      15176469
total_bc_found  674257
nbr_bc_0_mismatches     159440
nbr_bc_1_mismatches     514817

ATAC_S25_L001_R2_001.fastq.gz:
nbr_reads:      16592075
total_bc_found  742674
nbr_bc_0_mismatches     181133
nbr_bc_1_mismatches     561541

ATAC_S24_L002_R2_001.fastq.gz:
nbr_reads:      21818614
total_bc_found  950152
nbr_bc_0_mismatches     213626
nbr_bc_1_mismatches     736526

ATAC_S25_L002_R2_001.fastq.gz:
nbr_reads:      16543558
total_bc_found  741736
nbr_bc_0_mismatches     180998
nbr_bc_1_mismatches     560738

# With OMP_NUM_THREADS=4
real    1m52.256s
user    7m28.170s
sys     0m5.939s

# With OMP_NUM_THREADS=6
real    1m35.313s
user    9m16.558s
sys     0m7.383s

# With OMP_NUM_THREADS=8
real    1m9.729s
user    8m49.437s
sys     0m6.365s

# With OMP_NUM_THREADS=16
real    1m9.140s
user    17m6.616s
sys     0m12.671s

# With OMP_NUM_THREADS=36
real    1m15.074s
user    40m47.396s
sys     0m21.305s
arshajii commented 4 years ago

There's no simple way right now to set the number of threads programmatically, although it can technically be done by calling the omp_set_num_threads OpenMP C function:

cimport omp_set_num_threads(i32)
omp_set_num_threads(i32(num_threads))  # num_threads is an int

The code you posted definitely looks reasonable to me, and I'd expect it to be amenable to parallelization. How many cores/threads does the machine you're running on have?

inumanag commented 4 years ago

@ghuls : Thanks for spotting this!

Two things:

ghuls commented 4 years ago

The machine has 36 cores, so if I don't specify OMP_NUM_THREADS, it will run with 36 (and it will be slower than with 8 and use 40 minutes of user time instead of 9 minutes. I assumed seq would figure out it should only use 8 threads.

How would I parallelize bc_matches_whitelist in a useful way?

The I/0 should be pretty fast. The node I am using uses data from GPFS over infiniband.

ghuls commented 4 years ago

How to use define nbr_reads, nbr_bc_0_mismatches and nbr_bc_1_mismatches variable per fastq file (similar to global, but only for a specific function (find_bc_in_fastq))?

I also assume that just defining a variable global is not thread safe (e.g. when modifying it inside process). What is the best way to ensure thread safety for global or per function variables when being written from from multiple threads? @atomic, lock or read lock? Can @atomic only be used on a function or also for just one line? Should @atomic just be used for one variable a the time, or is it ok to use for updating multiple variables?

https://github.com/seq-lang/seq/blob/b025f609d24b691eb12f740eab83c0d4d3b201a7/test/pipeline/parallel.seq

nbr_reads = 0
nbr_bc_0_mismatches = 0
nbr_bc_1_mismatches = 0

def find_bc_in_fastq(fastq_filename: str):
#    nbr_reads = 0
#    nbr_bc_0_mismatches = 0
#    nbr_bc_1_mismatches = 0

    global nbr_reads
    global nbr_bc_0_mismatches
    global nbr_bc_1_mismatches

    #for record in FASTQ(fastq_filename, gzip=True, validate=False, copy=True):
    def process(record: FASTQRecord):

        @atomic
        def update_counters(bc_match: int):
            global nbr_reads
            global nbr_bc_0_mismatches
            global nbr_bc_1_mismatches

            nbr_reads += 1

            if bc_match == 0:
                nbr_bc_0_mismatches += 1
            elif bc_match == 1:
                nbr_bc_1_mismatches += 1

        bc_match = bc_matches_whitelist(Kmer[16](record.seq))

        update_counters(bc_match)

        #if nbr_reads == 100000:
        #    break

    FASTQ(fastq_filename, gzip=True, validate=False, copy=True) |> blocks(size=1000) ||> iter |> process

    print f'{fastq_filename}:\nnbr_reads:\t{nbr_reads}\ntotal_bc_found\t{nbr_bc_0_mismatches + nbr_bc_1_mismatches}\nnbr_bc_0_mismatches\t{nbr_bc_0_mismatches}$

if len(argv) == 1:
    stderr.write('Usage: {argv[0]} fastq_with_bc_file [fastq_with_bc_file ...]\n')
    exit(1)
else:
    iter(argv[1:]) ||> find_bc_in_fastq
inumanag commented 4 years ago
ghuls commented 4 years ago

Thanks.

I just played a bit around with parsing of a BAM file and noticed that the example is lacking some info: https://docs.seq-lang.org/cookbook.html#reading-sam-bam-cram

Also is there a way to get the type from a tag, so the right function can be called on the tag?

Can bam_endpos be exposed in c_htslib (and in bam) as it is very useful to get the genomic end position of a read: https://github.com/samtools/htslib/blob/19c189438f852e6e62dbda73f854d465cebb3d9f/sam.c#L343-L349

$ samtools view sample.bam | head -n 3

NB501171:585:HLM2TBGXG:1:22108:8631:16195       0       chr1    3005850 255     91M     *       0       0       CCAATCCAGCAGTGGCACCTGTGCACAACAACAAACAATAGCTTGACACAATAACCCTAAGGGTATACTAGAGACATGCAGTAACTATGAT     AAA/AAEE6EAEE//E//<E//E/E//E//E</EE/A/EA/<//E///</E/EAE////A///EE<</E/////EE//<<6///<<E///6     NH:i:1  HI:i:1 AS:i:77  nM:i:6  CR:Z:ACGACATTAA_AAGGTGTGGT_TACGACGATG   CY:Z:A//6AA//EA_EE//E//A/E_AE/EEAE<<6   UR:Z:CCCGGGGGCA UY:Z:AAAAAAAAAA
NB501171:585:HLM2TBGXG:3:21410:20944:11818      0       chr1    3005850 255     91M     *       0       0       CCAATCCAGCTGTGGCACCTGTGGACCACAACAAAGACTAGCTTGACACAATAACCCTAAGGGTATACTAGTGACATGCAGTAACTATGAT     AAAAAEEAEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEAEAEEAAEEAEEAA/EEEEEEEEEAAEAE6EE<E<AEAA//E<EE/E/     NH:i:1  HI:i:1 AS:i:89  nM:i:0  CR:Z:AGGACATGAA_AAGGTGTGGT_TACGACGATG   CY:Z:AAAAAEEEEE_EEEEEEEEEE_EEEAEE<EEE   UR:Z:CAAGACGCGC UY:Z:AAAAAAAAAA CB:Z:AGGACATGAA_AAGGTGTGGT_TACGACGATG   UB:Z:CAAGACGCGC
NB501171:585:HLM2TBGXG:3:22408:2332:9008        0       chr1    3005850 255     91M     *       0       0       CCAATCCAGCTGTGGCACCTGTGGACCACAACAAAGACTAGCTTGACACAATAACCCTAAGGGTATACTAGTGACATGCAGTAACTATGAT     AAAAAEEEEEEEEEEE/EEEEEEEEAE/EEEEEA/EEEEEEE<<E/EEEEEEE/EA<AEEEEAEEEEEEEEEEEE/EEE/<AEAAEE<AEE     NH:i:1  HI:i:1 AS:i:89  nM:i:0  CR:Z:AGGACATGAA_AAGGTGTGGT_TACGACGATG   CY:Z:AAAAAEEEEE_EEEE<EEEEE_EEEAEEEEE<   UR:Z:CGCACAGGGC UY:Z:AAAAAAAAAA CB:Z:AGGACATGAA_AAGGTGTGGT_TACGACGATG   UB:Z:CGCACAGGGC
bam_filename = 'sample.bam'

nbr_reads = 0

# Assign BAM() to bam, so we can get the contig later.
bam = BAM(bam_filename)

# Does not work if we want to print the contig name.
# for r in BAM(bam_filename):

for r in bam:
    print r.name
    print r.tid
    print bam.contig(r.tid)
    print r.locus
    print r.pos
    print r.read
    print r.mapq
    print r.cigar
    print r.read1
    print r.read2
    print r.duplicate
    print r.reversed

    print r.aux('CR')
    # Segfault as AS is an integer
    #print r.aux('AS')
    # Works like this.
    print r.aux('AS').i
    # Segfaults if CB tag is not available for the current read.
    print r.aux('CB')

    # Works like this
    cb = r.aux('CB')
    if cb:
        print f"CB {cb.Z}"

    # etc.

    print

    nbr_reads += 1

    if nbr_reads == 3:
        break;
seqc read_bam.seq
NB501171:585:HLM2TBGXG:1:22108:8631:16195
194
chr1
Locus(tid=194, pos=3005849, reversed=False)
3005849
CCAATCCAGCAGTGGCACCTGTGCACAACAACAAACAATAGCTTGACACAATAACCCTAAGGGTATACTAGAGACATGCAGTAACTATGAT
255
91M
False
False
False
False
ACGACATTAA_AAGGTGTGGT_TACGACGATG
77

NB501171:585:HLM2TBGXG:3:21410:20944:11818
194
chr1
Locus(tid=194, pos=3005849, reversed=False)
3005849
CCAATCCAGCTGTGGCACCTGTGGACCACAACAAAGACTAGCTTGACACAATAACCCTAAGGGTATACTAGTGACATGCAGTAACTATGAT
255
91M
False
False
False
False
AGGACATGAA_AAGGTGTGGT_TACGACGATG
89
CB AGGACATGAA_AAGGTGTGGT_TACGACGATG

NB501171:585:HLM2TBGXG:3:22408:2332:9008
194
chr1
Locus(tid=194, pos=3005849, reversed=False)
3005849
CCAATCCAGCTGTGGCACCTGTGGACCACAACAAAGACTAGCTTGACACAATAACCCTAAGGGTATACTAGTGACATGCAGTAACTATGAT
255
91M
False
False
False
False
AGGACATGAA_AAGGTGTGGT_TACGACGATG
89
CB AGGACATGAA_AAGGTGTGGT_TACGACGATG
arshajii commented 4 years ago

Most of these are quirks of htslib, which we use for parsing BAM/SAM. But,

ghuls commented 4 years ago

Storing the contig name in the SAMRecord is not necessary for me.

I figured out a way to get type info from tags in htslib (see here to see how to convert the one byte type char to the correct type): https://github.com/samtools/htslib/blob/2264113e5df1946210828e45d29c605915bd3733/sam.c#L3648-L3726

Basically get a pointer to the result returned by: bam_aux_get(r.__raw__(), ptr[byte](__ptr__(tag_arr))) and match that byte to get the correct type:

from bio.c_htslib import bam_endpos, bam_aux_get

bam_reader = BAM(bam_filename)

for r in bam_reader:
            # Only use reads with are not marked as duplicate.
            if not r.unmapped and not r.duplicate:
               # Get cell barcode tag.
               cb = r.aux('CB')

               # Only use reads which have cell barcode tag.
               if cb:
                   print "\n" + r.name
                   print(f'{bam_reader.contig(r.tid)}\t{r.pos}\t{int(bam_endpos(bam_reader.__raw__()))}\t{cb.Z}')

                   print 'CB:Z', ptr[byte](cb.s)[0]

                   print 'AS:i', ptr[byte](r.aux('AS').s)[0]

                   print 'HI:i', ptr[byte](r.aux('AS').s)[0]

Output: Z = str, C = u8

NB501171:598:HVLM7BGXG:1:21305:5934:15468
chr2L   5062    5112    TTGTCTGACGACCGTCCGTCTCAACGTCAC
CB:Z Z
AS:i C
HI:i C

NB501171:598:HVLM7BGXG:2:21106:24765:9141
chr2L   5062    5112    AAGTAAGCACTAACCACCGCCGTTGTCAGC
CB:Z Z
AS:i C
HI:i C

NB501171:598:HVLM7BGXG:3:11511:7045:12139
chr2L   5062    5112    TAACACGAAGCTGATATTCATCTTCCAACT
CB:Z Z
AS:i C
HI:i C

Could you also add bam_endpos function from HTSlib as it allows to calculate the most right end location based on the CIGAR string.

diff --git a/stdlib/bio/c_htslib.seq b/stdlib/bio/c_htslib.seq
index ea8b9a4..d8dd9c9 100644
--- a/stdlib/bio/c_htslib.seq
+++ b/stdlib/bio/c_htslib.seq
@@ -17,6 +17,7 @@ from LD cimport bam_read1(cobj, cobj) -> i32
 from LD cimport bam_init1() -> cobj
 from LD cimport bam_cigar2qlen(int, ptr[u32]) -> int
 from LD cimport bam_cigar2rlen(int, ptr[u32]) -> int
+from LD cimport bam_endpos(cobj) -> i32
 from LD cimport bam_aux_get(cobj, ptr[byte]) -> ptr[u8]
 from LD cimport bam_aux2i(ptr[u8]) -> int
 from LD cimport bam_aux2f(ptr[u8]) -> float

Or allow it to be calculated from a SAM record:

hts_pos_t bam_endpos(const bam1_t *b)
{
    hts_pos_t rlen = (b->core.flag & BAM_FUNMAP)? 0 : bam_cigar2rlen(b->core.n_cigar, bam_get_cigar(b));
    if (rlen == 0) rlen = 1;
    return b->core.pos + rlen;
}
arshajii commented 4 years ago

Ah, good to know! This should be very easy to add then.

bam_endpos can also be added -- we'll include these in the next release.

arshajii commented 4 years ago

Just pushed some additions related to your post above:

These will all be included in the next release!

ghuls commented 3 years ago

This comment should have B2f(index) instead of B2i(index): https://github.com/seq-lang/seq/blob/ee3d3d4e615c355d1091e76a49ebfe1a0d6065dd/stdlib/bio/bam.seq#L167

inumanag commented 3 years ago

print function landed in #132. print statement is still supported as print (note the space).

ghuls commented 3 years ago

@inumanag This is not fixed yet:

This comment should have B2f(index) instead of B2i(index):

https://github.com/seq-lang/seq/blob/ee3d3d4e615c355d1091e76a49ebfe1a0d6065dd/stdlib/bio/bam.seq#L167

inumanag commented 3 years ago

Fixed now on #132 . Thanks.