timknut / geno_imputation

Documentation and code base for the Geno/Roslin imputation project
2 stars 2 forks source link

markerlist file FinalReport_54kV1_ed1.txt #5

Closed Unoqualsiasi closed 7 years ago

Unoqualsiasi commented 7 years ago

It appears that this file contains 73628 SNPs instead of 54001 as reported in the header of the file -.-

grep '2005' FinalReport_54kV1_ed1.txt | cut -f 1 > FinalReport_54kV1_ed1_markerlist.txt

wc -l FinalReport_54kV1_ed1_markerlist.txt
timknut commented 7 years ago

grep '2005' FinalReport_54kV1_ed1.txt | wc will also grep other lines with 2005 in them, eg:

tikn@login-0:~/for_folk/geno/geno_imputation/genotype_rawdata/illumina54k_v1$ grep '2005' FinalReport_54kV1_ed1.txt | tail -5
Hapmap52005-BTA-75510   5409    A       G       0.9098
Hapmap55117-rs29020058  5409    G       G       0.9003
ARS-BFGL-NGS-42005      5409    C       C       0.8761
BTA-120182-no-rs        5409    G       G       0.2005
Hapmap43172-BTA-120051  5409    G       G       0.8855

Use:

awk '$2==2005 {print $0}' FinalReport_54kV1_ed1.txt | wc -l
54001

Awk is is safer, since it operates on columns.

If you want to use grep, use :

grep '\s2005\s' FinalReport_54kV1_ed1.txt | cut -f 1 > FinalReport_54kV1_ed1_markerlist.txt

This way you only match 2005 with space around it..

Unoqualsiasi commented 7 years ago

oh fk the boundaries...you are right XD

i was using awk approach the first time i don't know why now i am using grep. I think you should update the script prepare_plink_map_example.Rmd with awk option.

just a small fix :

awk '$2 == 2005 {print $1}' OFS='\t' FinalReport_54kV1_ed1.txt > output