kbraziun / stat679_notes

course notes to stat 679
0 stars 0 forks source link

tableofSNPs solutions for kristin-braziunas #2

Open kbraziun opened 6 years ago

kbraziun commented 6 years ago

About

I have uploaded two scripts to my git repo containing my solutions for modifying the tableofSNPs.csv to perform data cleaning (task) and permute nucleotides (extra task). I have also included within this issue my code to check that these scripts have completed their tasks successfully. For review by @cecileane and @coraallencoleman.

task: data cleaning

Write a one-liner using sed to remove " and , from the Minimum column.

script

The script is located at scripts/fix_minimums.sh. The script must be run from the main directory. To run, type:

bash scripts/fix_minimums.sh > data/clean/SNPs_clean.csv

checking the edited version

To ensure that the script has run correctly, check the edited version with:

gsed -E 's/^[^,]+,[^,]+,[^,]+,[^,]+$/match/' data/clean/SNPs_clean.csv | uniq

If a given row has exactly three commas in it, it will be replaced by the word match. If not, the unedited row will be displayed. The uniq command compresses all consecutive matches into a single line.

extra task: nucleotide permutation

Write a one-liner using sed to permute A to T and T to A.

script

The script is located at scripts/permute_nucleotides.sh. The script must be run from the main directory. To run, type:

bash scripts/permute_nucleotides.sh > data/clean/SNPs_permuted.csv

checking the edited version

To ensure that the script has run correctly, there are a few different ways to check the edited version.

Check that the new number of A's matches the original number of T's.

grep -o A data/clean/SNPs_permuted.csv | wc -c
grep -o T data/raw/tableofSNPs.csv | wc -c

Check that the new number of T's matches the original number of A's.

grep -o T data/clean/SNPs_permuted.csv | wc -c
grep -o A data/raw/tableofSNPs.csv | wc -c

Visualize the file headers to ensure that A's and T's have been swapped.

head data/clean/SNPs_permuted.csv | grep --color [AT]
head data/raw/tableofSNPs.csv | grep --color [AT]
cecileane commented 6 years ago

awesome!