uw-ssec / tutorials

SSEC tutorials for various topics
https://uw-ssec-tutorials.readthedocs.io
BSD 3-Clause "New" or "Revised" License
4 stars 4 forks source link

feat: Add data from full astro-ph arXiv to the RAG based model #16

Closed vanitech closed 4 months ago

vanitech commented 7 months ago

Create a separate vectorDB with a larger astro dataset from https://arxiv.org/archive/astro-ph

(Will add test prompts to this issue shortly)

lsetiawan commented 7 months ago

This is related to https://github.com/uw-ssec/tutorials/issues/8

lsetiawan commented 7 months ago

Here's a code snippet once one has downloaded the archive.zip file from kaggle:

import zipfile
import json
import pandas as pd
import io

cols = ['id', 'title', 'abstract', 'categories']

with zipfile.ZipFile('archive.zip') as archive:
    data = []
    json_file = archive.filelist[0]
    with archive.open(json_file) as f:
        for line in io.TextIOWrapper(f, encoding="latin-1"):
            doc = json.loads(line)
            lst = [doc['id'], doc['title'], doc['abstract'], doc['categories']]
            data.append(lst)

    df_data = pd.DataFrame(data=data, columns=cols)

astro_df = df_data[df_data.categories.str.contains('astro-ph')].reset_index(drop=True)

The code above, will get 338658 abstracts from the astro-ph category.

lsetiawan commented 7 months ago

To help loading the abstract to the vector database supported by LangChain, you can use the dataframe loader from above dataframe

from langchain_community.document_loaders import DataFrameLoader

loader = DataFrameLoader(astro_df, page_content_column="abstract")

# this will read the dataframe and make Document objects
# it will load everything into RAM!
documents = loader.load()

# another option so it doesn't load everything into memory at once
# documents = loader.lazy_load()
madhavmk commented 7 months ago

Hello @lsetiawan and @vanitech, I was looking into various Document retrieval approaches, and it usually boils down to using either a vector DB or a vector library blog explaining the difference. Have we decided which of the 2 we are planning to use?

lsetiawan commented 7 months ago

I've implemented the vector DB part, could you try explore the vector library?

anantmittal commented 6 months ago

Resources:

lsetiawan commented 4 months ago

Done in https://github.com/uw-ssec/tutorials/commit/87150c98fdea40135f92bde54911038ad9e2b4ef

Rendered version on this can be found at: https://uw-ssec-tutorials.readthedocs.io/en/latest/SciPy2024/appendix/astrophysics-dataset-creation.html