extract readname from .las

MichelMoser commented 9 years ago

Dear Gene,

I would like to extract reads (based on their name) which mapped to a reference genome. Is it possible to extract the read names instead of their index numbers (second column in the .las file?) with LAshow? Or could i just parse LAshow output [2nd column] and pipe it to DBshow with the -n option? This works but i am not sure if i get the correct results.

Another thing which was bugging me were the columns 1 and 3 from the .las output. Could you tell me what the integer at col_1 and 'n' or 'c' at col_3 stand for?

paxi_mt.50smrtex: 420,488 records
col: [1]   [2  ][3] [                 4                 ]        [   5       ]  [     6       ]
  1         128 c   [ 21,982.. 23,180] x [ 7,448.. 8,710] :   <    125 diffs  ( 13 trace pts)
  1         291 n   [ 22,018.. 23,180] x [ 3,095.. 4,380] :   <    185 diffs  ( 12 trace pts)
  1         386 c   [ 21,969.. 23,167] x [ 4,940.. 6,179] :   <    138 diffs  ( 13 trace pts)
  1         463 n   [ 43,537.. 45,191] x [   323.. 2,249] :   <    327 diffs  ( 17 trace pts)
  1         711 n   [ 21,980.. 23,169] x [ 4,799.. 6,069] :   <    173 diffs  ( 13 trace pts)
  1         775 c   [ 21,976.. 23,180] x [11,411..12,648] :   <    103 diffs  ( 13 trace pts)
  1         785 c   [ 21,968.. 23,163] x [ 1,012.. 2,214] :   <    107 diffs  ( 13 trace pts)

Thank you, Michel

thegenemyers commented 9 years ago

Column 1 is the A read, column 2 the B read, and column 3 the orientation (c for complement, n for normal) e.g. read 1 has a local alignment with the complement of read 128, read 1 has a local alignment with read 291, read 1 has an LA with the complement of read 386, etc. This is also described in my blog post:

https://dazzlerblog.wordpress.com/2014/07/10/dalign-fast-and-sensitive-detection-of-all-pairwise-local-alignments/

Hoped that helped, Gene

On 10/12/15, 7:52 AM, MichelMoser wrote:

Dear Gene,

I would like to extract reads (based on their name) which mapped to a reference genome. Is it possible to extract the read names instead of their index numbers (second column in the .las file?) with LAshow? Or could i just parse LAshow output [2nd column] and pipe it to DBshow with the -n option? This works but i am not sure if i get the correct results.

Another thing which was bugging me were the columns 1 and 3 from the .las output. Could you tell me what the integer at col_1 and 'n' or 'c' at col_3 stand for?

|paxi_mt.50smrtex: 420,488 records col: [1] [2 ][3] [ 4 ] [ 5 ] [ 6 ] 1 128 c [ 21,982.. 23,180] x [ 7,448.. 8,710] : < 125 diffs ( 13 trace pts) 1 291 n [ 22,018.. 23,180] x [ 3,095.. 4,380] : < 185 diffs ( 12 trace pts) 1 386 c [ 21,969.. 23,167] x [ 4,940.. 6,179] : < 138 diffs ( 13 trace pts) 1 463 n [ 43,537.. 45,191] x [ 323.. 2,249] : < 327 diffs ( 17 trace pts) 1 711 n [ 21,980.. 23,169] x [ 4,799.. 6,069] : < 173 diffs ( 13 trace pts) 1 775 c [ 21,976.. 23,180] x [11,411..12,648] : < 103 diffs ( 13 trace pts) 1 785 c [ 21,968.. 23,163] x [ 1,012.. 2,214] : < 107 diffs ( 13 trace pts)

Thank you, Michel

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/28.

MichelMoser commented 9 years ago

Thank you. So to extract from .las the read names from B i could indeed do something like:

LAshow   A.dam  B.db  A_B.las  |sed 's/\s\+/\t/' -  | awk  'NR > 2 {print $2}' | DBhow -n B.db -  > B_read_names_which_match_A.txt

thegenemyers commented 9 years ago

I didn't understand the pipe, but it is correct that DBshow -n will give you the headers for each read index. So definitely the correct idea. -- Gene

On 10/13/15, 10:02 AM, MichelMoser wrote:

Thank you. So to extract from .las the read names from B i could indeed do something like:

LAshow A.dam B.db A_B.las sed 's/\s+/\t/' - awk 'NR > 2 {print $2}' DBhow -n B.db - > B_read_names_which_match_A.txt

— Reply to this email directly or view it on GitHub https://github.com/thegenemyers/DALIGNER/issues/28#issuecomment-147638232.

thegenemyers / DALIGNER

extract readname from .las #28