dropSeqPipe configuration

grst commented 5 years ago

Hi @Hoohm,

while creating the sample.csv files, I came up with a couple of questions:

Is the n_cels parameter in samples.csv mandatory? I plan to filter downstram with scanpy, and cannot find the 'expected cells' per library in the description of every dataset.
Read length only refers to R1 read length, right?
Some samples consist of multiple runs with different read lengths each (:roll_eyes:). How to deal with them? Suggestion: treat them as individual samples
Which brings me to the next question: does it make any difference if I (a) concatenate two fastq files and treat them as a single sample or (b) leave them as they are and treat them as two samples.

Hoohm commented 5 years ago

Yes, this you have to provide. It will use it as the number of cells to extract using umi-tools. If you have no idea at all, we shoule use a very high number. Hopefully with the sparse matrix, it should pose a big problem
Read length of R2, the RNA. This is used for index generation os STAR.
Exactly the same samples with different read lengths? That is odd. I would keep them apart and combine them at the very end based on the cell barcode.
You can concatenate different lanes together, this is usually the case with 10x data. Although I would not concatenate samples that have different lengths. Trimming and error rate are gonna be biased.

Hoohm commented 5 years ago

For the first question, 10x provides a whitelist of all possible cell barcodes. We can use those if you want. meaning we extract all 737k cells and then filter.

grst commented 5 years ago

For the first question, 10x provides a whitelist of all possible cell barcodes. We can use those if you want. meaning we extract all 737k cells and then filter.

That's basically what I do now for the lambrechts data from cellranger. It seems to work quite well. But how would that look like for the other protocols? Or would it work if I just specify a large, arbitrary number, say 200,000?

Exactly the same samples with different read lengths? That is odd. I would keep them apart and combine them at the very end based on the cell barcode.

They have the same GSM identifier on GEO, but multiple SRR identifiers on SRA. I think they contain different cells. -> keeping them seperate defintely makes sense.

grst commented 5 years ago

They actually have different read lengths within the same file. No idea how they got there. It's all Illumina HiSeq 2500. >zless SRRXXXXXX_2.fastq.gz 2018-11-26_15 17 18_782x282

What will STAR do when it does not have an index for a certain read length? Actually, the authors use STAR, too, but I couldn't find out so far how they generate the index.

grst / single_cell_data_integration

dropSeqPipe configuration #11