Feature Request - Add DataFrames (Spark or Pandas) as Sources

big-analytics commented 1 year ago

Currently, embedchain allows the addition of various types of data sources such as YouTube videos, PDF files, and web pages to be processed and used in the application. This feature request proposes to extend this functionality to include DataFrames, specifically those from the Spark or Pandas libraries, as potential data sources.

DataFrames are a commonly used data structure for handling and manipulating data in Python, especially in data science and machine learning applications. They are particularly effective when dealing with large, structured datasets, which can include text data.

The ability to use DataFrames as a source of data would add a significant amount of flexibility to embedchain, as users could directly input their preprocessed and transformed data into the application. This could be beneficial in scenarios where the data is already available in a DataFrame format, such as when it has been preprocessed or transformed as part of a larger data pipeline.

The implementation of this feature would involve adding a new method to the App class (or modifying the existing .add() method) that accepts a DataFrame and its format (Spark or Pandas) as arguments. The method would then handle the loading of the data from the DataFrame into the application in the appropriate format, ready to be processed and used in the application.

This feature would increase the flexibility and usefulness of embedchain, making it more applicable to a wider range of scenarios and use-cases, and potentially attracting a broader user base. It would also align well with common data science workflows, which often involve the use of DataFrames for data manipulation and analysis.

Please consider adding this feature in a future update of embedchain.

cachho commented 1 year ago

with today's addition of local sources, qna pairs and presumably text, we've come a lot closer to providing this functionality.

cachho commented 1 year ago

as far as my understanding goes, a dataframe is just a 2d table. Can we directly pass that to the vector database or does it have to be a string?

Do you have an example how you would process that with regular ChatGPT? Like for my QnA pair PR, it's a originally a tuple, but we transform it to match this format: https://platform.openai.com/playground/p/default-qa

If we have to convert to a string, how would you do it?

taranjeet commented 10 months ago

cc @deshraj

mem0ai / mem0

Feature Request - Add DataFrames (Spark or Pandas) as Sources #18