Closed peterk87 closed 7 years ago
Curious enough, if I remove pident
would that affect: https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/blast_wrapper/__init__.py#L219
I wouldn't touch
https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/blast_wrapper/const.py#L3-L19 or any of the other blast_wrapper
code, but rather filter out the columns you don't want from the detailed report, e.g. something like this:
useless_cols = ['pident', 'length', ...]
df = df[df.columns[~df.columns.isin(useless_cols)]]
I haven't tested that code, but what you're saying is test each column name if it's in your list of useless_cols
, negate all bool values with ~
, and filter for a list of columns to subset df
.
You could also make useless_cols
a global variable like https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/blast_wrapper/const.py#L3 and call it COLUMNS_TO_REMOVE
or something better (naming things is hard).
You could also make a list like https://github.com/phac-nml/bio_hansel/blob/ddee5c948b40722e2fe11282809fa4b05faf3292/bio_hansel/subtyper.py#L20 but for the detailed report so that the columns are output in an order that makes sense rather than in a semi-random order.
That makes sense thanks for the suggestions Peter. I'll probably try the SUBTYPE_SUMMARY_COLS
approach just so consistent order is maintained.
Note that the detailed output from running a genome assembly (FASTA) vs reads (FASTQ) are slightly different so you may want to append any columns to your DataFrame subsetting list that are present in the DataFrame that aren't present in your column ordering list.
Actually, I think I like the:
useless_cols = ['pident', 'length', ...]
df = df[df.columns[~df.columns.isin(useless_cols)]]
idea more. I'll be able to filter out columns just for FASTA easily.
For etc
, did you mean the literal column etc
? or the columns that are similar to the ones mentioned above.
No, I just used etc
as a placeholder to mean other kind of pointless to report BLASTN output columns. See https://github.com/phac-nml/bio_hansel/pull/12#pullrequestreview-65954555
Remove default BLASTN tabular output columns: