phac-nml / biohansel

Rapidly subtype microbial genomes using single-nucleotide variant (SNV) subtyping schemes
Apache License 2.0
26 stars 7 forks source link

Remove unnecessary columns from FASTA detailed output #7

Closed peterk87 closed 7 years ago

peterk87 commented 7 years ago

Remove default BLASTN tabular output columns:

mgopez commented 7 years ago

Curious enough, if I remove pident would that affect: https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/blast_wrapper/__init__.py#L219

peterk87 commented 7 years ago

I wouldn't touch https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/blast_wrapper/const.py#L3-L19 or any of the other blast_wrapper code, but rather filter out the columns you don't want from the detailed report, e.g. something like this:

useless_cols = ['pident', 'length', ...]
df = df[df.columns[~df.columns.isin(useless_cols)]]

after https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/subtyper.py#L108

I haven't tested that code, but what you're saying is test each column name if it's in your list of useless_cols, negate all bool values with ~, and filter for a list of columns to subset df.

You could also make useless_cols a global variable like https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/blast_wrapper/const.py#L3 and call it COLUMNS_TO_REMOVE or something better (naming things is hard).

You could also make a list like https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/subtyper.py#L20 but for the detailed report so that the columns are output in an order that makes sense rather than in a semi-random order.

mgopez commented 7 years ago

That makes sense thanks for the suggestions Peter. I'll probably try the SUBTYPE_SUMMARY_COLS approach just so consistent order is maintained.

peterk87 commented 7 years ago

Note that the detailed output from running a genome assembly (FASTA) vs reads (FASTQ) are slightly different so you may want to append any columns to your DataFrame subsetting list that are present in the DataFrame that aren't present in your column ordering list.

mgopez commented 7 years ago

Actually, I think I like the: useless_cols = ['pident', 'length', ...] df = df[df.columns[~df.columns.isin(useless_cols)]]

idea more. I'll be able to filter out columns just for FASTA easily.

mgopez commented 7 years ago

For etc, did you mean the literal column etc? or the columns that are similar to the ones mentioned above.

peterk87 commented 7 years ago

No, I just used etc as a placeholder to mean other kind of pointless to report BLASTN output columns. See https://github.com/phac-nml/bio_hansel/pull/12#pullrequestreview-65954555