walaj / bxtools

Tools for analyzing 10X Genomics data
MIT License
42 stars 10 forks source link

understanding bxtools stats output #2

Closed danshu closed 7 years ago

danshu commented 7 years ago

Thanks for your useful tool! I have got stats for my 10x genomics data. "bxtools stats $bam > stats.tsv

output is BX count median_isize median_mapq"

If I understand it correctly, BX is the barcode and count number of barcode/pool/droplet. I found that in my dataset, there are many BX with very few count. Thus I want to filter these low-frequency BXs, but haven't had idea about it.

In your example

make a list of bad tags (freq < 100)

Is "freq < 100" a general standard for filtering bad BXs? And also because the input for bxtools stats was unaligned bam file containing paired end reads, so freq=100 in field 2 should represent a frequency of 50 for a BX?

I have also tried to find out how to set this filtering threshold in literature. In this paper "A hybrid approach for de novo human genome sequence assembly and phasing", I found the following sentence: "those barcodes that were seen below a given threshold fre- quency (22 for library 1 and 101 for library 2, based on the lowest frequency among the number of barcodes that were detected in these libraries by 10XG’s Long Ranger software)". Actually I can not understand what does "the lowest frequency among the number of barcodes that were detected in these libraries by 10XG’s Long Ranger software" mean. Actually if I filter use the lowerest frequency, there is not any filtering at all, right?

Sorry if my question is a little unrelated to your tool.

Best, Danshu

walaj commented 7 years ago

Hi Danshu, I think the best I could advise would be to make a histogram of the BX counts and find a reasonable cutoff for your own data. bxtools was a sort of weekend-type project to help with some 10X analyses led by some collaborators, so I wouldn't presume at this point to give much advice on interpreting the outputs. e.g. the value of "100" was completely ad hoc. I'm not really sure what is meant by their filter either, so you might reach out to the authors.

That said, I'd like to learn more sometime about 10X data. In the meantime, if you have any suggestions for the types of low-level data manipulation (like stats or split) that would be useful from bxtools, do let me know. I'd be interested in expanding this!

Best, Jeremiah

danshu commented 7 years ago

Hi Jeremiah,

Thanks for your kind advice. Actually I have tried to plot the BX counts to find a reasonable threshold. But what matters should be the characteristics of the technology, e.g. low frequency of BX count indicates poor quality of that pool, which I'm not sure. Anyway I will try to ask the company for more details.

Actually I'm trying to use my 10x genomics data for scaffolding my assembly. One common requirement of those scaffolders are attaching BX to the read name, while different tools have different styles of attaching BX. For example, ARCS (https://github.com/bcgsc/arcs) will be expecting the barcode at the end of the read name, in this format: READNAME_TAGCATAGACATCAGA Great if bxtools may combine filtering with attaching BXs to read names.

I managed to do this job by extracting filtered reads using "seqtk subseq" and then rename them according to the format required: "awk '{ if (NR%4==1) {split($2,a,":"); split(a[3],b,"-"); print $1"_"b[1] } else { print } }' test.fastq > rename_test.fastq".

Best, Danshu

walaj commented 7 years ago

I'm in a mood to procrastinate on other things at the moment, decided to give this a shot... Just added relabel. Does this do what you need? You can pipe the output from relabel to samtools view and AWK to get to a fastq from a BAM if you need.

danshu commented 7 years ago

Thanks and I will try it later!