pachterlab / splitcode

Flexible and efficient parsing, interpreting and editing of sequencing reads
https://pachterlab.github.io/splitcode/
BSD 2-Clause "Simplified" License
35 stars 0 forks source link

Documentation #1

Closed timoast closed 7 months ago

timoast commented 1 year ago

Hi @Yenaled

We met at biology of genomes this year and discussed this package. I'm interested in using it for a pipeline I'm working on, I know you're still actively developing this but I'm just wondering if it is in a "useable" state yet, and if so do you have any basic documentation available to help get started? What I'd like to do in my pipeline is extract two barcodes from the index read, do some barcode correction to a whitelist, then attach the barcodes to the read names in another fastq file.

Yenaled commented 1 year ago

@timoast

Hi Tim, the package is "useable but not documented" currently. My plan is to write the documentation later this month (and to hopefully put out a preprint sometime next month). That said, I am happy to show how to use this package for specific use cases. The package is pretty far along enough so that future updates to it should be fully backwards-compatible with the current version.

I'll post a simple example here for your use case in a day or two. Stay tuned!

timoast commented 1 year ago

Awesome! Thanks @Yenaled, appreciate it

Yenaled commented 1 year ago

@timoast

Apologies for the delay. I'll give an example for your use case here:

Say you have 4 files: I1.fq, I2.fq, R1.fq, R2.fq

Create a config file (config.txt) as follows (here, just showing a whitelist of two barcodes for I1 and three barcodes for I2):

groups  tags    distances   locations
bc1 AAGC    1   0:0:4
bc1 TTAA    1   0:0:4
bc2 GCCTT   1   1:0:5
bc2 CGACG   1   1:0:5
bc2 AAGCT   1   1:0:5

Basically, the "tags" column contains the barcode sequences, the "distances" column contains the Hamming distance mismatch allowance (here allowing one mismatch), the "locations" column tells the program the window within which to search for the barcode (e.g. 0:0:4 = file 0, barcode is within the 4-bp window of position 0-4; 1:0:5 = file 1, start position 0, barcode is within the 5-bp window of position 0-5). The "groups" contains what you name the groups that the sequences fall into (here, bc1 and bc2 for barcode 1 and barcode 2 respectively).

Then you can run:

splitcode --config=config.txt --seq-names --nFastqs=4 --trim-only --output=out1.fq,out2.fq --keep-grp=<(echo "bc1,bc2") --select=2,3 I1.fq I2.fq R1.fq R2.fq

Briefly, here's a description of these options: --seq-names = Put the barcodes (from config.txt) extracted from reads into the FASTQ name comment section in the output files. It'll have the SAM tag CB:Z: so, for example, a FASTQ read identifier line in the output file might look like "@read1 CB:Z: AAGCGCCTT" --nFastqs = Indicates that 4 files constitute a single "read" (because we have I1, I2, R1, R2) --trim-only = This requires going into a little detail. Basically, in the default mode, splitcode will take the extracted barcodes and give it a unique identifier. If that same list of barcodes is encountered again, it'll be mapped into to that same identifier. This is useful for many applications (e.g. split-pool sequencing where barcode structure is complex). However, in --trim-only mode, we skip all that. --output = Specify output files (separated by commas). Here, we specify two output files (R1 and R2 outputs). You can add .gz to the files if you want to compress them. We list two output files because we want to output two FASTQ files (see the --select option). --keep-grp = Technically you supply a file that contains a list of what "groups" you want to keep (here, we say we only keep a read if a barcode in the bc1 group followed by a barcode in the bc2 group is found in the read; if we don't see that, we ignore the read) --select = Which fastq files should we output (here, we output file 2 and file 3 which are R1.fq and R2.fq; we don't output file 0 and file 1 which are the index files).

To install, simply do:

git https://github.com/Yenaled/splitcode.git
cd splitcode
mkdir build
cd build
cmake ..
make
make install

Hope this helps! I apologize if there's anything that's unclear and am happy to promptly respond to questions! As this has not been published/peer-reviewed yet, I can't guarantee that the features will be bug free but if you encounter anything strange, please let me know asap and I'll fix it asap.

timoast commented 1 year ago

Thanks @Yenaled! I'll give it a try and let you know how it goes.

timoast commented 1 year ago

Is there any way to specify the barcode region in the read, but have it just match all sequences in that region (like setting "tags" to be a wildcard)? Or match to anything in a large whitelist file?

My exact use-case is I have reads where there's a Tn5 barcode in one position in the index read (which tells what type of assay), and another part of the same read has the 10x Genomics cell barcode. What I'd like to do is:

  1. Match the Tn5 barcode sequence to my list (which is small, <10),
  2. Extract both the Tn5 and 10x barcode sequences
  3. Attach each sequence to the FASTQ comment field with different read tags, for example: CB:Z:<10x barcode sequence> TB:Z:<Tn5 sequence>
timoast commented 1 year ago

From looking at the command line help, maybe I don't include the cell barcode in the config file but add the --extract parameter something like --extract={bc2}14<cb[16]> if the cell barcode is 14 bp after bc2, and 16 bp in length?

Yenaled commented 1 year ago

OK, I understand now. I'll need to make a minor addition to splitcode. I need to give it an option to specify different read tags (currently CB:Z: is the only one). And yes, you can extract the 16-bp 10x barcode sequence that way and it'll be outputted into a file called cb.fq along with your other output files (but I still need to provide an option to get the Tn5 barcode into the TB:Z: tag). I'll do it this week so stay tuned. I'll update you and provide an example when I do.

timoast commented 1 year ago

Thanks! I managed to get it working, I ran with:

    --extract={{bc2}}14\<cb[16]\> \ 
    --x-names \

and it's put the cell barcode in the read1/read2 comment field along under "RX:Z" tag and the Tn5 barcodes under "CB:Z" tag. If you're able to add the ability to specify the tag names that would be awesome, but this gets the job done!

Yenaled commented 1 year ago

@timoast

Alright, I just pushed a commit that allows overriding the default SAM tags.

The default SAM tags used in the program are specified as:

--sam-tags=CB:Z,RX:Z:,BI:i:

Simply change it anything else you'd like :)

And sorry, I forgot to mention previously that 'group' names should be enclosed in two curly braces, e.g. {{bc2}} (but it seems you figured that out). The 'barcode' names are enclosed in single curly braces, e.g. {GCCTT} (by default, the barcode names are the sequences).

I should mention that while --seq-names will contain the original sequences you listed in your config.txt file, the 10X barcodes you extract via --extract doesn't do any error correction (it just extracts the raw sequences from the reads). To error-correct to a whitelist, this program currently doesn't support that functionality. I do plan on putting in some basic error correction though (it's the next thing I'm working on).

timoast commented 1 year ago

It looks like there's a compile error with the new update:

make[1]: Entering directory '/home/stuartt/github/splitcode/build'
make[2]: Entering directory '/home/stuartt/github/splitcode/build'
Scanning dependencies of target splitcode_core
make[2]: Leaving directory '/home/stuartt/github/splitcode/build'
make[2]: Entering directory '/home/stuartt/github/splitcode/build'
[ 20%] Building CXX object src/CMakeFiles/splitcode_core.dir/ProcessReads.cpp.o
[ 40%] Building CXX object src/CMakeFiles/splitcode_core.dir/main.cpp.o
[ 60%] Linking CXX static library libsplitcode_core.a
make[2]: Leaving directory '/home/stuartt/github/splitcode/build'
[ 60%] Built target splitcode_core
make[2]: Entering directory '/home/stuartt/github/splitcode/build'
Scanning dependencies of target splitcode
make[2]: Leaving directory '/home/stuartt/github/splitcode/build'
make[2]: Entering directory '/home/stuartt/github/splitcode/build'
[ 80%] Building CXX object src/CMakeFiles/splitcode.dir/main.cpp.o
[100%] Linking CXX executable splitcode
/usr/bin/ld: CMakeFiles/splitcode.dir/main.cpp.o: in function `usage()':
main.cpp:(.text+0x19df): undefined reference to `ProgramOptions::sam_tags_default'
collect2: error: ld returned 1 exit status
make[2]: *** [src/CMakeFiles/splitcode.dir/build.make:87: src/splitcode] Error 1
make[2]: Leaving directory '/home/stuartt/github/splitcode/build'
make[1]: *** [CMakeFiles/Makefile2:96: src/CMakeFiles/splitcode.dir/all] Error 2
make[1]: Leaving directory '/home/stuartt/github/splitcode/build'
make: *** [Makefile:130: all] Error 2
Yenaled commented 1 year ago

Very strange -- it compiles fine on my end. Nonetheless I modified the potentially problematic code and pushed a new commit. It should work on your end now.

timoast commented 1 year ago

Thanks, that fixed it, and the new --sam-tags parameter works nicely!

Yenaled commented 7 months ago

Closing this issue since documentation is complete - https://splitcode.readthedocs.io/en/latest/index.html and a description is preprinted - https://doi.org/10.1101/2023.03.20.533521

Hope you find the software helpful for manipulating+preprocessing complex FASTQ files whenever the need arises!

Congrats on starting your lab earlier this year btw!

timoast commented 7 months ago

Thanks @Yenaled!