single nucleotide error log

dlaehnemann commented 6 years ago

For each input file, write out a log file with a similar name and some sort of suffix like .mdasim-copy-errors.log which documents the introduction of each single nucleotide error. I think, this should be some sort of tab separated file with:

#seq   pos     orig-nt  err-nt
chr17  175353  T        G

VSack commented 6 years ago

I don't think that it is a good idea to include #seq into this. You can give mdasim any kind of fasta sequence (if it's long enough), if incorporating the sequence name makes sense at all, I'd put it in the name of the log file to distinguish between log files of different runs. In that case, the sequence name would have to be read from the input file or the cmd line. However I might have misunderstood your intention of having that #seq column.

Also, we should think about how to specify the position of the error exactly - relative to the entire length or relative to the sequences in the output file:

>485 | name = R507 | fragment = 242 | position = 31310 
AAGGCAGGGAATGCCCTGGGAAGGAGACTCCAGTTCCACACAGGCCTCCTGCAGAA ...
>27094 | name = R508 | fragment = 243 | position = 27094 
AGCATGAGGAAGCTGCTCATGGACCAAGTGCTCCCCCTCGGTGCAATTTAAATTTG...

As the README_mdasim1-2.txt describes:

The first two values of the ID line for each amplicon is the length of the amplicon and the name of the amplicon. The name of the amplicon is in the format of R(Index of the amplicon). Indexes of amplicons start from 1 and increase for each amplicon by one. The last one shows the total number of amplicons in the file. The last two values in each line are mostly used for debugging the program. The number that comes after the term "fragment = " is the ID of the fragment that the respective amplicon comes from. In fact, as it is described in [1], the amplicons are generated by cleaving fragments from the positions the double- and single-starnded regions meet (look at 3rd step of the algorithm in [1]). The number that comes after the term "position = " is the position of the last nucleotide of the amplicon in the respective fragment.

Whichever way we chose, from a first glance at the source code, it won't be a simple two-line-change to get those values to write them into the log file.

dlaehnemann commented 6 years ago

Hmm, OK. Then maybe this is something to deduct otherwise from the wrapping pipeline. Let's discuss this tomorrow and keep this issue open for now. Thanks for putting some thought into it!

dlaehnemann commented 6 years ago

OK, after discussing this in the videocall, here's what I would go for:

Create a command line option, that when given a file name, will write out at least the following tab-delimited log information on error:

#reference  substitution
A           G
A           T

If any more information, like the name of the fragment on which the error is introduced or some kind of other positional information is available, it might be worth writing that out, as well. Please check and consider this, but if in doubt, we can discuss it again.

Even with the basic output, we can at least compare the distribution of substitutions we get from this log output to the distributions we get from our output testing pipeline in issue #2.

dlaehnemann commented 6 years ago

Thanks for implementing this. As discussed in the videocall, I'd explicitly name the log-file, as this gives us more flexibility -- and that should be about two lines of code change... ;)

hzi-bifo / mdasim

single nucleotide error log #1