Problem: different input file results in incorrect parsing

micheleolivieri commented 6 years ago

The input file is supposed to be like this: row1: GENE_CLONE GENE Sample1 Sample2 row2: A1BG_CACCTTCGAGCTGCTGCGCG A1BG 94 713

But screens data are now analyzed with TKOv2 that contains the genomic coordinates instead of the sgRNA sequences, like this: row2: chr1:172628558-172628577FASLG+ FASLG 85 165

This causes the incorrect parsing (see picture below) when uploading a sample and it is probably for this reason that the upload results in an error.

Considerations: Genomic coordinates contain important info for further analyses, but I don't know to which reference genome they refer to (GRCh37 or GRCh38) so it could be a potential source of error when comparing different libraries. Also, it would be nice to have the corresponding guide sequence for each genomic coordinate, to analyze if a specific sgRNA is good in different cell lines (useful for follow up experiments).

knightjdr commented 6 years ago

So different libraries will need to be parsed in different ways to extract the guide information. I can try to autodetect the correct way to parse but also add an option on the sample input page to allow you to switch between parsing options to ensure the column is being parsed correctly.

The bigger issue is what should be stored with the sample? Should the guide sequence be stored or the chromosomal region or both? We need to be consistent otherwise if people are referring to guides in different ways that is a problem. I suppose recording both the guide sequence and chromosomal region would be best. It looks like the TKO library files can be used for mapping so I will implement saving both guide sequence and chromosome region based on the TKO files.

If people use a custom library then they would need to upload these mapping files so I will add documentation on formatting for this.

micheleolivieri commented 6 years ago

I am not sure how different libraries generate different output readcounts files. I know that all the screens, done in our lab with TKOv2 and TKOv3 libraries, always generate a readcounts file with the genomic coordinates instead of the sgRNAs sequences. That said, the sgRNAs sequences are the most important thing to have, in my opinion. It is the most useful info because it allows checking if the sgRNAs is also present in the genome of the cell line we are screening (maybe it was not a hit because there was a single nucleotide polymorphism?), and also to compare different libraries with the same guides. The genomic coordinates instead are less important and could be a potential source of error (genomic coordinates depends on the genome of reference, so different libraries with the same genomic coordinates, although maybe improbable, could indicate different regions). Anyway, having the genomic coordinates would be interesting for further analysis, one example could be that maybe the hits for that particular condition/drug/treatment/cell line are genes clustering in the same chromosome....I personally never had to use them but it would be an interesting bonus feature.

So, in conclusion, I would say that, in my opinion, the most important thing is the unique sgRNA sequence. I guess now the options are that we re-do all the analysis in a way in which we have the correct first column (GENE_CLONE), otherwise if there is a way to convert all the genomic coordinates in guide sequences using a mapping file as you suggested, that could be faster. I don't know which is easier to do and implement thou. Maybe there could be a way in which the user can decide, for the first column, if s/he wants to apply parsing (and decide the specific part to keep) or if s/he wants to use a mapping file. Unfortunately, the way you designed the uploading tool was the best one: very easy to understand...this will complicate things...

knightjdr commented 6 years ago

I have updated CRISPR screens to support two input formats for the guide column, either as chrgene+/- or gene_guide. The default will be chrgene+/- since that is the main format you use but users can change parsing options on the sample input form via a dropdown. After uploading the sample the TKO libraries are used for mapping so that each read count will have both the guide and chromosome region for it.

knightjdr / screenhits

Problem: different input file results in incorrect parsing #1