jpuritz / dDocent

a bash pipeline for RAD sequencing
ddocent.com
MIT License
52 stars 42 forks source link

Very inefficent scan of full reference genome #71

Closed ne1s0n closed 3 years ago

ne1s0n commented 3 years ago

Four times in the code there's a line like

if head -1 reference.fasta | grep -e 'dDocent' reference.fasta 1>/dev/null; then

The logic is to take the first line of the reference fasta (head command) containing the name of the first chromosome/scaffold. The string is then passed to grep to check if it's a dDocent-created file. However the grep command specifies again the reference file, thus forcing the scan of the entire uncompressed reference, which can easily be in the GB of data. The correct command should be

if head -1 reference.fasta | grep -e 'dDocent' 1>/dev/null; then

without the file name. In this way useless computation is avoided.

This issue is present four times in the code:

FIRST INSTANCE https://github.com/jpuritz/dDocent/blob/9718247b7f533a71057787d77c5232b6b97065c5/dDocent#L341

SECOND INSTANCE https://github.com/jpuritz/dDocent/blob/9718247b7f533a71057787d77c5232b6b97065c5/dDocent#L424

THIRD INSTANCE, slightly different in the searched string. The issue remains. https://github.com/jpuritz/dDocent/blob/9718247b7f533a71057787d77c5232b6b97065c5/dDocent#L341

FOURTH INSTANCE https://github.com/jpuritz/dDocent/blob/9718247b7f533a71057787d77c5232b6b97065c5/dDocent#L424

jpuritz commented 3 years ago

I will happily review and accept a pull request.

jpuritz commented 3 years ago

Fixed