wbazant / wbazant.github.io

0 stars 0 forks source link

Text processing #14

Open wbazant opened 6 years ago

wbazant commented 6 years ago

Alan's fastas concatenated into a blast db

[wbazant@ebi-cli-002 ~]$ head -n1 /nfs/nobackup/ensemblgenomes/wormbase/parasite/production/jbrowse/WBPS12/tsv/schistosoma_mansoni_prjea36577.tsv | tr '\t' '\n' | cat -n 1 Study 2 Submitting centre 3 Track 4 Library size (reads) 5 Mapping quality (reads uniquely mapped) 6 Results 7 Developmental stage 8 Geographic location 9 Host disease 10 Isolate 11 Organism part 12 Phenotype 13 Sex 14 Strain 15 Timepoint 16 Treatment

wbazant commented 6 years ago

NBCI format: 80 chars per line, with unpack perl -E '$/=">"; while(<>){my ($l,@seq) = split "\n"; my $seq = join "", @seq; say ">$l"; for my $seq_line (unpack("(A80)*", $seq)) {say uc($seq_line);} }

wbazant commented 6 years ago

Pick bits of fasta you need

perl  -MIO::Uncompress::Gunzip -e 'my %ids;   
open (my $ids,"<", @ARGV[0]);
 while (<$ids>){chomp; $ids{$_}++};
 my $fa = IO::Uncompress::Gunzip->new(@ARGV[1]);
 $/ = ">";
 while(<$fa>){my ($id) = split "\n";
$id =~ s/.*gene=(.*)/$1/;
 print if $ids{$id};
};
'
wbazant commented 6 years ago

uniq on the first two columns of a TSV file:

 perl -e 'my %h; while(<>){my @F = split "\t"; print unless $h{"$F[0]$F[1]"}++;}' 
wbazant commented 5 years ago

Scaffold lengths

perl -E '
$/=">"; while(<>){
  my ($l,@seq) = split "\n"; 
  my $seq = join "", @seq; 
  chomp $seq; 
  say ">$l ". length $seq;
}
' strongyloides_papillosus_prjeb525.fa | perl -pe 's/\>//' | tr ' ' $'\t'
wbazant commented 5 years ago

https://askubuntu.com/a/849016

sed -n '/^WBGene00284906/,/=$/p' t_muris.PRJEB126.WS270.orthologs.txt
wbazant commented 3 years ago
perl -ne '$l=$_ if rand()<(1/$.); END{print $l}'

To pick a random line from a stream of unknown length in a single pass while only storing one line. With some more options it will handle a fortune file (or any stream of things that you can go through by record).