shayprat / Demultiplex

Demux assignment for Bi622
0 stars 0 forks source link

Demultiplexing Pseudocode Feedback #1

Open claire-j-wells opened 1 month ago

claire-j-wells commented 1 month ago

Hi Shayal!

In general, your pseudocode is concise and readable. I can follow your train of thought easily! One point I noticed is you forgot to integrate in a way to account for statistics in your method. I also forgot to integrate this into my own pseudocode so it’s something I noticed! Another thing to consider is in your game plan, you plan to “check if sequence line of R2.fq and RC_R3 contain any “N’s”. This is a pretty sound strategy, but it might be helpful to instead just check if it’s in the given list of indexes in the first place because if it contains an N, then it won’t match the given list anyway. By doing this, then you can combine two lines of code into one. For your functions, I like the way you’re going with them. It seems like you’ve thought about them a lot more than me and I especially like your thought process for generate the reverse complements. However, I was a bit confused because you define index to = -1 as a way to start at the last base, but then set index == to “A” and I was a bit confused by that. I haven’t done too much hands-on testing for this function but you could maybe use a dictionary to do this. In general, your other functions seem to be reasonable given your plan and I think you’re on the right track!

Bendycar commented 1 month ago

Hi Shayal,

Everything here looks super solid! I think it will all end up working as intended, but there are a few opportunities for what (I think?) will be pretty significant improvements in efficiency -- which I think is pretty important to think about, given that we're working with 1.5 billion line text files here!

1.) If I'm not misunderstanding you, it looks like your game plan is to loop through the files twice -- once to create the reverse complement of R3, append the indices to the header and then create new fastq files ("R1.R2.RC_R3.fq"), then again to check if they're unmatched / unknown / valid. I think it would be much faster to make sure you're just looping through the original files once, rather than creating a whole new file to loop through a second time.

2.) On a similar note, it looks like your reverse complement function is set up to open the file and iterate through the entire thing. If you're going to change to the strategy I described earlier, I think you should restructure this a little bit to just take a string as input (the sequence line, which you could extract in your main code), rather than taking the whole file as input and extracting the sequence lines within the function. I also agree with what Claire said about this function -- using a dictionary (with the keys as each base and value as their complement, something like {"A" : "T", "G" : "C", etc}) would be much faster than your current method.

All in all, I think this is a great start - I just strongly suspect based on things Leslie has said that efficiency is going to be important for this project, so try to think about how to optimize wherever you can!

-Ben