projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
271 stars 111 forks source link

How to read tsv? #368

Closed Hoeze closed 7 months ago

Hoeze commented 3 years ago

I'm trying to run:

glow.transform(
    "pipe",
    input_df.limit(10),
    cmd=json.dumps(shlex.split(cmd)),
    inputFormatter='vcf',
    inVcfHeader='infer',
    outputFormatter='csv',
    out_delimiter="\t"
)

but I still only get a single column. How can I read tsv output?

karenfeng commented 3 years ago

Hi @Hoeze! I just ran a quick test locally, and the CSV piper should still work for TSVs. However, the CSV datasource exposes many options, including whether there is a header (out_header) or there are comments (out_comment). Can you tell me more about what the command you're running?

Hoeze commented 3 years ago

@karenfeng The issue that I had was that the option should be called outDelimiter instead of out_delimiter.

However, I'm having another issue now. For some reason, the options outNullValue and outEmptyValue do not work:

import json
import shlex

vep_transformed_df = glow.transform(
    "pipe",
    input_df.limit(10).distinct(),
#     cmd=json.dumps(shlex.split("cat | grep -v '^##'")),
    cmd=json.dumps(shlex.split(vep_cmd)),
    inputFormatter='vcf',
    inVcfHeader='infer',
    outputFormatter='csv',
#     outQuote="##",
    outHeader=True,
    outDelimiter="\t",
    outNullValue="-",
    outEmptyValue="-",
)
# vep_transformed_df.toPandas()["cDNA_position"].iloc[0]
'-'

Is there again some difference in naming?

henrydavidge commented 7 months ago

Closing since we now support only the text piper and to/from csv functions in Spark