sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 67 forks source link

Created a inDropsV3 whitelist generator function with user defined library indexes #171

Open mariusmessemaker opened 4 years ago

mariusmessemaker commented 4 years ago

I wrote a python function to generate inDrops V3 whitelist from the gel_barcode2_list.txt file (mostly adapted from the indrops.py code): https://github.com/indrops/indrops/blob/master/ref/barcode_lists/gel_barcode2_list.txt. I thought some zUMIs users might want to use this to generate their own whitelists to supply to zUMIs. Function arguments are:

indexes: list of strings that contain the library adapter sequences that were used to generate the libraries (e.g. ['AGAGGATA', 'TACTCCTT']).

name: string that contains the save .txt name for your whitelist. indexlength: int that specifies the number of bases in your library index (to make sure you don't make manual copy errors) numberOflibraries: int that specifies the number of libraries that were generated (i.e. the number of different library indexes that were used; also to make sure you don't make manual copy errors).

The function outputs the Cartesian product of all the R2 BC1s, R3 library indexes, R3 BC2s in concatenated strings in the order:

R1: 'AAACAAAC'
R3: 'ATAGAGAG'
R4: 'GTTTGTTT'

Concatenated string: 'AAACAAACATAGAGAGGTTTGTTT'

The function:

def generateIndropV3Whitelist(indexes, name, indexlength, numberOflibraries):
    # Code mostly adapted from Adrian Veres author of indrops.py: https://github.com/indrops/indrops
    # Assuming zUMIs file input order: R2, R3, R4 

    from itertools import product, combinations
    import string

    ___tbl = {'A':'T', 'T':'A', 'C':'G', 'G':'C', 'N':'N'}
    def rev_comp(seq):
        return ''.join(___tbl[s] for s in seq[::-1])

    def checkIfDuplicates(listOfElems):
        setOfElems = set()
        for elem in listOfElems:
            if elem in setOfElems:
                return True
            else:
                setOfElems.add(elem)
        return False

    # Check that you copied the indexes completely 
    for index in indexes:
        if len(index) != indexlength:
            return 

    # Check that you didn't copy the indexes with duplicates 
    if checkIfDuplicates(indexes):
        return 

    # Check if you supplied the correct number of non-duplicate indexes 
    if len(indexes) != numberOflibraries: 
        return 

    with open('gel_barcode2_list.txt') as f:
        bc2s = [line.rstrip() for line in f]
        rev_bc2s = [rev_comp(bc2) for bc2 in bc2s]

    barcode_iter = product(bc2s, indexes, rev_bc2s)
    v3_names = []

    with open(name, 'w') as f:
        for barcode in barcode_iter: 
            print(''.join(barcode)) 
            f.write(''.join(barcode) + '\n')
    return

For example, you can use the function as follows:

#  Sample 1,  library_index: "ATAGAGAG"
#  Sample 2, library_index: "AGAGGATA"
#  Sample 3, library_index: "TACTCCTT"
#  Sample 4, library_index: "AGGCTTAG"
#  Sample 5, library_index: "CTAGTCGA"
#  Sample 6, library_index: "AGCTAGAA"
#  Sample 7, library_index: "CTTAATAG" 
#  Sample 8, library_index: "ATAGCCTT"

indexes = ['ATAGAGAG', 'AGAGGATA', 'TACTCCTT', 'AGGCTTAG', 'CTAGTCGA', 'AGCTAGAA', 'CTTAATAG', 'ATAGCCTT']
generateIndropV3Whitelist(indexes, './whitelists/pool1_whitelist.txt', 8, 8)

This function call outputs 384 BC1 x 8 library indexes x 384 BC2 = 1179648 concatenated BC strings. You can also supply an empty library index string, which will yield the cartesian product of the R2 BC1s and R4 BC2s:

indexes = [' ']
generateIndropV3Whitelist(indexes, './whitelists/bc1andbc2.txt', 0, 1)
Qotov commented 4 years ago

@mariusmessemaker , could you please tell us, do you use this generated whitelist or allows zUMIs to autogenerate it by itself, and does it affect the final results?

Thanks

mlizio commented 2 years ago

Hi, I have a question related to barcodes generation, but for Split-seq:

I am analysing experiments with four barcodes, so should I choose to provide a whitelist to zUMIs, would I concatenate the barcodes following a 3' -> 5' direction, or R1, then R2?

The experimental design is such that reads are barcoded as follows: R1 has UMI+BC1+cDNA, R2 has BC2+GGG+cDNA, and additionally I have two more index files with another barcode each. All barcodes are 8nt, so my cells will be identified by a 32nt barcode.

Does it matter how I concatenate the barcodes? Do I anyway create a barcodes list with all possible combinations?

thanks

cziegenhain commented 2 years ago

Hi,

If you choose to provide a whitelist, you would concatenate the barcode pieces in the order you are providing the barcode ranges to use in the YAML file. So if your file1 has R1 then start with that barcode, etc. For reference: https://github.com/sdparekh/zUMIs/wiki/Barcodes#barcode-annotation

Best, Christoph

mlizio commented 2 years ago

Hi, thank you Christoph. It makes sense now, that being the way zUMIs reads in the files!!

Cheers