pathwaycom / pathway

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
https://pathway.com
Other
2.84k stars 98 forks source link

How to Embed each dict in jsonline format #57

Open abdul756 opened 1 month ago

abdul756 commented 1 month ago

I am building a RAG app using llm-app that tells flight offers available between source and dest. When user asks please suggest chepeast flight between source and destination it should show fare and all the details of that flight.

I want to calculate the emdedding vectors of each dict of jsonline , how to achieve it.

Sample format `{"flight_offer_id": "1", "fare_details": 67.02, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T09:30:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T11:30:00", "carrierCode": "UK", "number": "822", "aircraft_code": "320", "operating_carrierCode": "UK", "segment_duration": "PT2H", "segment_id": "1", "numberOfStops": 0, "blacklistedInEU": false}

{"flight_offer_id": "2", "fare_details": 67.02, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T20:30:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T22:30:00", "carrierCode": "UK", "number": "824", "aircraft_code": "320", "operating_carrierCode": "UK", "segment_duration": "PT2H", "segment_id": "2", "numberOfStops": 0, "blacklistedInEU": false}

{"flight_offer_id": "3", "fare_details": 67.02, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T06:45:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T08:50:00", "carrierCode": "UK", "number": "828", "aircraft_code": "320", "operating_carrierCode": "UK", "segment_duration": "PT2H5M", "segment_id": "7", "numberOfStops": 0, "blacklistedInEU": false}

{"flight_offer_id": "4", "fare_details": 70.63, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T07:55:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T10:00:00", "carrierCode": "AI", "number": "571", "aircraft_code": "32N", "operating_carrierCode": "AI", "segment_duration": "PT2H5M", "segment_id": "8", "numberOfStops": 0, "blacklistedInEU": false}

{"flight_offer_id": "5", "fare_details": 70.63, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T15:50:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T17:55:00", "carrierCode": "AI", "number": "672", "aircraft_code": "32N", "operating_carrierCode": "AI", "segment_duration": "PT2H5M", "segment_id": "9", "numberOfStops": 0, "blacklistedInEU": false} `

dxtrous commented 1 month ago

Hi @abdul756 not at all sure this is a case for vector search but if you want to do that, you may want to pass "metadata_column" to your chosen indexing approach https://pathway.com/developers/api-docs/indexing and use "metadata_filter" for query to be able to pass hard bounds on times and places etc.

As for extracting data from JSON elements into columns, this very short guide explains some possible ways - UDF being the most general: https://pathway.com/developers/user-guide/types-in-pathway/json_type

abdul756 commented 1 month ago

I will try and let you know, If i face any problem please help me

abdul756 commented 1 month ago

HI @dxtrous Here is my table ` | price | itinearies ^6A0QZMJ... | "104.10" | [{"duration": "PT10H", "segments": [{"aircraft": {"code": "321"}, "arrival": {"at": "2024-06-01T14:30:00", "iataCode": "CJB"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-01T13:20:00", "iataCode": "MAA", "terminal": "4"}, "duration": "PT1H10M", "id": "3", "number": "429", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}, {"aircraft": {"code": "32N"}, "arrival": {"at": "2024-06-01T23:20:00", "iataCode": "BOM", "terminal": "2"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-01T21:35:00", "iataCode": "CJB"}, "duration": "PT1H45M", "id": "4", "number": "662", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}]}]

^SN0FH7F... | "104.10" | [{"duration": "PT21H35M", "segments": [{"aircraft": {"code": "321"}, "arrival": {"at": "2024-06-01T14:30:00", "iataCode": "CJB"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-01T13:20:00", "iataCode": "MAA", "terminal": "4"}, "duration": "PT1H10M", "id": "88", "number": "429", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}, {"aircraft": {"code": "32N"}, "arrival": {"at": "2024-06-02T10:55:00", "iataCode": "BOM", "terminal": "2"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-02T09:00:00", "iataCode": "CJB"}, "duration": "PT1H55M", "id": "89", "number": "608", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}]}]

^9KM937R... | "125.11" | [{"duration": "PT1H50M", "segments": [{"aircraft": {"code": "737"}, "arrival": {"at": "2024-06-01T22:50:00", "iataCode": "BOM", "terminal": "1"}, "blacklistedInEU": false, "carrierCode": "SG", "departure": {"at": "2024-06-01T21:00:00", "iataCode": "MAA", "terminal": "1"}, "duration": "PT1H50M", "id": "100", "number": "681", "numberOfStops": 0, "operating": {"carrierCode": "SG"}}]}]`

Now i if a user ask any questons related to flight which indexing i should use for example, if a user ask please get me details of chepeast flight or expensive flight it should display all details from itinearies column based on duration . Here https://pathway.com/developers/api-docs/indexing there are so many indexing algo please help me in chosing better algo for my use case and explain me how this data column and metadata column should be selected with the table i provided

zxqfd555-pw commented 1 month ago

Hi Abdul,

You may start with the KNN LSH index for indexing the first attempt on indexing. After you have the whole process up and running, it may make sense to compare different indexes between themselves to fine-tune the application.

In the scenario you describe, you will also need an embedder to embed these JSONs containing information about flights. Some of the embedders are provided here, but alternatively you can implement your embedder as a UDF that takes a string or JSON and return its' embedding as a vector of floats. Please note that there is no native embedder here: this task requires you to use a third-party API, like one from OpenAPI.

Also, as Adrian mentioned above, this case may not fit the vector search. After you have the embeddings and the index which can be queried, there is no guarantee that this index will return the cheapest flight details for the given endpoints and date. While you could probably improve it with a RAG technique, it looks much more like a graph problem where the combination of a source and a timeslot (00:00-01:00, 01:00-02:00, etc) can be a node, while a flight between two sources can be an edge. Therefore if the vector search results don't suit you, it makes sense to look at this angle.

abdul756 commented 1 month ago

@zxqfd555-pw Am using embedder from openAI, for example am using pw.indexing.DataIndex(data_table, inner_index, embedder=None)](https://pathway.com/developers/api-docs/indexing#pathway.stdlib.indexing.DataIndex) I just need to know how to pass the innex index it will be just price or it will inlcude itinearies and how to use metadata_filter in this case

zxqfd555-pw commented 1 month ago

The metadata filter would be needed if you index a set of files and would like the index to perform requests only on a specific subset matching a certain pattern. I would say it's not needed for the first attempt on the app.

I would suggest that you pass the embeddings of a full JSON payload as if you pass the price, that would clearly be not enough to answer the query.