valenlab / amplican

9 stars 4 forks source link

Report error and question about normalization with control sample #7

Closed misik closed 2 years ago

misik commented 4 years ago

Hi, I am using the docker image for amplican. docker pull quay.io/biocontainers/bioconductor-amplican:1.4.0--r351_0 I have a question about normalization. Your paper suggests that normalization is done using control samples but I still see the insertion/deletion rates for Control sample in the events_filtered_shifted_normalized.csv file. Can you explain how the normalization works with the controls? image ID24,ID25,ID26 and ID27 are the controls in this figure.

Another question: I consistently get the following error: label: plot read heterogeneity (with options) List of 5 $ echo : logi FALSE $ fig.height: language height + 1 $ fig.width : num 14 $ message : symbol F $ warning : logi FALSE

|......................................................... | 88% ordinary text without R code

|............................................................. | 94%

label: plot_alignments (with options) List of 4 $ results: chr "asis" $ echo : symbol F $ message: symbol F $ warning: symbol F

Quitting from lines 12-13 (amplicon_report.Rmd) Quitting from lines 50-51 (amplicon_report.Rmd) Error: Unequal parameter lengths: x (187), label (150), colour (150) In addition: There were 50 or more warnings (use warnings() to see the first 50)

How can I solve this issue?

JokingHero commented 4 years ago

Hi,

About your error: Current version is 1.6.2 and is available on Bioconductor - use this version, as this one gets all the support. For your version I can't help you with the problem, but you can try to edit amplicon_report.Rmd file yourself to fix it.

Normalization is explained extensively in the supplement to the paper here. Then it is also explained in the vignettes here and here.

And for your question why you still see the events for the control samples, why not? You know yourself those are controls so you can filter them out easily if you need to. Normalized are non-control samples, but control samples are left as they were originally so that you can see for yourself what was there in the control.

misik commented 4 years ago

Hi,

I installed version 1.16 using docker and ran into same error: label: plot_alignments (with options)

List of 4 $ results: chr "asis" $ echo : symbol F $ message: symbol F $ warning: symbol F

Quitting from lines 12-13 (amplicon_report.Rmd) Quitting from lines 50-51 (amplicon_report.Rmd) Error: Unequal parameter lengths: x (187), label (150), colour (150) In addition: There were 50 or more warnings (use warnings() to see the first 50) How can this be solved?

Also, is there a results file that mentions the final editing rates (percent editing) using normalized values? Are the numbers in config_summary.csv normalized read numbers?

Thank you, Meltem

misik commented 4 years ago

I noticed that many of the reads remain unassigned even if they have the guide sequence or at least one of the primers in them. What can be the reason for this? Config file: /amplican_config/g37_sequences_config_final.txt Average Quality: 25 Minimum Quality: 0 Write Alignments: txt Fastq files Mode: 0.5 Gap Opening: 25 Gap Extension: 0 Consensus: TRUE Normalize: guideRNA, Group PRIMER DIMER buffer: 30 Cut buffer: 5 Scoring Matrix: ,A,C,G,T A,5,-4,-4,-4 C,-4,5,-4,-4 G,-4,-4,5,-4 T,-4,-4,-4,5

Barcode,experiment_count,read_count,bad_base_quality,bad_average_quality,bad_alphabet,filtered_read_count,unique_reads,unassigned_reads,assigned_reads g37,11,1876189,0,8526,0,1867663,320900,265642,55258 g37_Ctrl,11,1609852,0,8752,0,1601100,279031,233430,45601

JokingHero commented 4 years ago

Hi again,

Final editing rates in the config_summary,csv can be calculated if you divide e.g. HDR/Reads_Filtered or Reads_Edited/Reads_Filtered. This is described in the documentation. Also you can see how pltos are made in the .Rmd files.

Do ?amplicanPipeline to see what can be done to solve your problem with read assignment. There is primer_mismatch parameter, increasing this to two mismatches could help, but it will probably calculate much slower. Also you used fastqfiles as 0.5 which should also help.

Do you still get error in the amplicon_report? I could try to debug this if I get the files for amplicon_report for you which are: amplicon_report.Rmd, events_filtered_shifted_normalized.csv, config_summary.csv, these files might be quite big, but if you could upload them somewhere for me I will fix it ASAP.

misik commented 4 years ago

Hi, The problem with reports is solved. However, I still see unassigned reads although the guide sequence and primer sequence completely matches to the sequence without any mismatches.

Also, the total read numbers that are mentioned in barcode_reads_filters.csv, config_summary.csv and events_filtered_shifted_normalized.csv do not match. There's no description what these numbers refer to and how they are calculated. Can you please write descriptions for the column names in these final files?

JokingHero commented 4 years ago

These files are for low level processing - using them directly requires some knowledge on the ampliCan and some knowledge on programming.

Why not rely on the html reports? The barcode report contains most frequent unassigned reads - additionally aligned forward to reverse to see the overlap, here you can see why would no primer be matched to the read. If it would be true that they do contain primers then that would mean there is a bug, but its unlikely as matching reads to barcodes works in some cases.

If you really want to use raw ampliCan files I am afraid you will have to spend some time reading, here all 3 files are described: vignette To see how those files are processed you can see code inside ".Rmd" files that make the html reports. Also you can read about various functions inside the R.

In its simplicity "barcode_reads_filters.csv" file shows only barcode information, from that barcode file filtered_read_count will be number of reads distributed in the "config_summary.csv" Reads column for that specific barcode, e.g. if you sum up all Reads for g37 barcode it should be equal to the filtered_read_count for this barcode. Next, the "events_filtered_shifted_normalized.csv" file contains all the events, they are summarized into different fileds using the amplicanSummarize function: Reads_In | Reads_Del | Reads_Edited | Reads_Frameshifted. Other fileds are what the names suggest: Reads_In is reads insertion count, bad_base_quality is count of reads with bad base quality etc.

Seeing that you have huge amount of unassigned reads means that there must be some issue with your primers or the sequencing itself? Check for this the barcode report.