pathwaycom / pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
https://pathway.com
Other
2.84k stars 98 forks source link

Jsonlines connector issue with mapping #4

Open Boburmirzo opened 10 months ago

Boburmirzo commented 10 months ago

Flatten data structures in a Jsonline file can not be mapped to structured schemas automatically.

For example, list_price and current_price mapping to the scheme is failing:

{"position": 1, "link": "https://www.amazon.com/Avia-Resistant-Restaurant-Service-Sneakers/dp/B0BJY1FN8F", "asin": "B0BJXSKK9L", "is_lightning_deal": false, "deal_type": "BEST_DEAL", "is_prime_exclusive": false, "starts_at": "2023-08-14T07:00:08.270Z", "ends_at": "2023-08-21T06:45:08.270Z", "type": "multi_item", "title": "Avia Anchor SR Mesh Slip On Black Non Slip Shoes for Women, Comfortable Water Resistant Womens Food Service Sneakers - Black, Blue, or White Med or Wide Restaurant, Slip Resistant Work Shoes Women", "image": "https://m.media-amazon.com/images/I/3195IpEIRpL._SY500_.jpg", "deal_price": 39.98, "list_price": {"value": 59.98, "currency": "USD", "symbol": "$", "raw": "59.98", "name": "List Price"}, "current_price": {"value": 39.98, "currency": "USD", "symbol": "$", "raw": "39.98", "name": "Current Price"}, "merchant_name": "Galaxy Active", "free_shipping": false, "is_prime": true, "is_map": false, "deal_id": "34f3da97", "seller_id": "A3GMJQO0HY62S", "description": "Avia Anchor SR Mesh Slip On Black Non Slip Shoes for Women, Comfortable Water Resistant Womens Food Service Sneakers - Black, Blue, or White Med or Wide Restaurant, Slip Resistant Work Shoes Women", "rating": 4.16, "ratings_total": 1148, "old_price": 59.98, "currency": "USD"}

In this data schema:

class Price(pw.Schema): value: float currency: str symbol: str raw: str name: str

class DealResult(pw.Schema): position: int link: str asin: str is_lightning_deal: bool deal_type: str is_prime_exclusive: bool starts_at: str ends_at: str type: str title: str image: str deal_price: Price list_price: Price current_price: Price merchant_name: str free_shipping: bool is_prime: bool is_map: bool

I tried to read the above jsonlines:

sales_data = pw.io.jsonlines.read(
    data_dir,
    schema=schema,
    mode="streaming",
    autocommit_duration_ms=50,
)

The error I got:

Read data parsed unsuccessfully. field deal_price with no JsonPointer path specified is absent in

If you do not specify schema param at all, Pathway does not do automatic mapping with all fields existing in jsonlines file.

Expected outcome:

An ideal solution will be when the connector can map everything automatically to table columns if I do not specify to extract any specific fields for processing. If I specify only extract list_price, then LLM App can create a table only with list_price.