projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
263 stars 110 forks source link

quarantine functionality in pipe transformer writes out full dataset #463

Closed williambrandler closed 5 months ago

williambrandler commented 2 years ago

The new quarantine functionality in the pipe transformer will successfully run if there are corrupted records, however,

As an example, I created an input dataframe of 961 rows. The expected output dataframe count is 500 and the quarantine table is 461. What I see is 500 and 961. It seems like all the data is piped out into the quarantine table.

See screenshots below. Code to reproduce can be found here https://github.com/projectglow/glow/pull/461

Screen Shot 2021-12-14 at 11 53 08 AM Screen Shot 2021-12-14 at 11 55 37 AM

mah-databricks commented 2 years ago

issue verified. working on a fix.

mah-databricks commented 2 years ago

proving more complex than originally thought. Requires a rewrite of some of the internals of the Piper

williambrandler commented 2 years ago

c'est la vie!

ok, thanks for the update.

How much effort do you think it will be?

mah-databricks commented 2 years ago

https://github.com/projectglow/glow/pull/469

williambrandler commented 2 years ago

thanks, this looks good