Closed vanitech closed 4 months ago
This is related to https://github.com/uw-ssec/tutorials/issues/8
Here's a code snippet once one has downloaded the archive.zip
file from kaggle:
import zipfile
import json
import pandas as pd
import io
cols = ['id', 'title', 'abstract', 'categories']
with zipfile.ZipFile('archive.zip') as archive:
data = []
json_file = archive.filelist[0]
with archive.open(json_file) as f:
for line in io.TextIOWrapper(f, encoding="latin-1"):
doc = json.loads(line)
lst = [doc['id'], doc['title'], doc['abstract'], doc['categories']]
data.append(lst)
df_data = pd.DataFrame(data=data, columns=cols)
astro_df = df_data[df_data.categories.str.contains('astro-ph')].reset_index(drop=True)
The code above, will get 338658
abstracts from the astro-ph
category.
To help loading the abstract to the vector database supported by LangChain, you can use the dataframe loader from above dataframe
from langchain_community.document_loaders import DataFrameLoader
loader = DataFrameLoader(astro_df, page_content_column="abstract")
# this will read the dataframe and make Document objects
# it will load everything into RAM!
documents = loader.load()
# another option so it doesn't load everything into memory at once
# documents = loader.lazy_load()
Hello @lsetiawan and @vanitech, I was looking into various Document retrieval approaches, and it usually boils down to using either a vector DB or a vector library blog explaining the difference. Have we decided which of the 2 we are planning to use?
I've implemented the vector DB part, could you try explore the vector library?
Done in https://github.com/uw-ssec/tutorials/commit/87150c98fdea40135f92bde54911038ad9e2b4ef
Rendered version on this can be found at: https://uw-ssec-tutorials.readthedocs.io/en/latest/SciPy2024/appendix/astrophysics-dataset-creation.html
Create a separate vectorDB with a larger astro dataset from https://arxiv.org/archive/astro-ph
(Will add test prompts to this issue shortly)