rachelss / SISRS

Site Identification from Short Read Sequences.
24 stars 15 forks source link

Pull Request: RAL_SE_Mapping #35

Closed BobLiterman closed 6 years ago

BobLiterman commented 6 years ago

A note on the conflicts: It appears that @anderspitman also found my README issues, and his changes mirror my own. These were in a previous pull request that wasn't taken up, so feel free to accept whichever version you want.

The majority of this pull request deals with specific_genome.py, which previously was a large memory drain. With SE mapping, there will only be a single .pileups file in a directory, allowing us to stream the file rather than load the entire thing into memory.

Additionally, previous versions of this script (and some subsequent scripts which I will amend shortly) only account for single digit indels, where in real data there can be double digit indels. I have amended the code to allow for double digit indels.

'D' is used as a placeholder where the pileup file specifies a deletion, in accordance with downstream scripts. It seems more appropriate than 'N', which canonically signifies an insertion of some sort.

I have also added a test script (test_getCleanBases.py) which can be used with py.test to test the base parsing.