mlcommons / croissant

Croissant is a high-level format for machine learning datasets that brings together four rich layers.
https://mlcommons.org/croissant
Apache License 2.0
443 stars 40 forks source link

All Python functions are annotated #208

Open marcenacp opened 1 year ago

marcenacp commented 1 year ago

Why?

This feature can have several benefits:

  1. This will speed up pytype.
  2. This will make typing less loose.

How?

sheyaln commented 1 year ago

Will this need to be assigned to an individual, or can I take a stab at it and make a PR?

marcenacp commented 1 year ago

@Milksheyke Please, feel free to send a PR :) Thanks!

sheyaln commented 1 year ago

Installed repo and dev tools, but I'm unable to generate the database with monkeytype. At least not fully.

croissant/python/mlcroissant/mlcroissant/_src/operation_graph/execute.py", line 98, in build_record_set
    len(result) == 1
AssertionError: "GroupRecordSet(default)" should have one and only one predecessor. Got: 0.

execute_operations_in_streaming in mlcroissant/_src/operation_graph/execute.py is passing in a DiGraph while build_record_set is expecting a MultiDiGraph. I'm guessing that's where the assertion error is coming from.

Am I missing something here, or should we introduce some sort of default behavior to handle if a node lacks a predecessor?