LangChain task list - Githubissues

pseudotensor commented 1 year ago

Starting from: https://github.com/h2oai/h2ogpt/pull/111

[ ] Add template user can choose/upload to constrain output
[ ] Avoid odd chops of chunks, see starts off with ure\n\nquestion\n\nplayed tennis - horrible chop will confuse LLM
[ ] Control hallucinations by 1) Checking rare words from response exist in docs 2) Checking 2-grams etc. that should be very rare and shouldn't be in response if not in docs. 3) Reject LLM response if has rare 1-gram 2-grams not matching sources.
[ ] Improve vector db lookup by 1) Creating more meta data, and embed that and similarity on that, including example questions might be asked 2) Create meta data out of query, e.g. examples, better match to chunks
[x] Ability to remove file(s) from db https://github.com/hwchase17/langchain/discussions/1690
[x] Be able to choose which document to target for a question, not entire db.
[ ] SQL query generation and access https://github.com/csunny/DB-GPT#sql-generation
[ ] CSV pandas agent
[x] Handle images https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/image.html
[x] Caption images: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/image_captions.html
[ ] Handle ipynb https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/jupyter_notebook.html
[x] Handle arxiv directly: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/arxiv.html
[x] Handle youtube transcripts: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/youtube_transcript.html
[ ] Handle git directly: https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/git.html
[ ] spin/block upload while uploading. Spin add upload too. https://github.com/gradio-app/gradio/issues/3382
[ ] spin add upload while adding https://github.com/gradio-app/gradio/issues/4245
[x] upload zip of docs
[x] avoid upload of duplicate source
[x] joblib or mp for parallel data handling, since each PDF takes a while single core. If nested in zip, tell child how many procs
[x] Show docs in each db in gradio
[x] Add more full integration tests
[x] Add api_name's to langchain gradio parts and add tests
[x] Make URL getting work
[ ] Can ground report more by showing matching words as well as matching n-grams from actual sources
[x] Add ability to do new separate chat and go back to other chats
[x] Add ability to download chat(s).
[x] Consume url from chat itself and do look up
[ ] https://github.com/h2oai/makersaurus
[ ] Make All just mean look at all dbs
[ ] Allow any github link to ingest if not on HF. If on HF, default to h2oGPT and DAI docs.
[ ] Unsure how to work around https://github.com/chroma-core/chroma/issues/412#issuecomment-1547027011
[ ] Good or Bad for Flag, and Compare should be which is better
[ ] NER (named entity recognition) for data extraction
[x] Consider https://github.com/weaviate/weaviate if want to store both vectors and objects in db -- not necessarily wanted in general, but makes db stable against needing original file locations for links. https://github.com/imartinez/privateGPT/pull/208
[ ] Use ray for parallel embedding etc.: https://www.anyscale.com/blog/llm-open-source-search-engine-langchain-ray
[ ] readthedocs support: https://gist.github.com/waleedkadous/d06097768abbea54d59e5d3ed4f045f3

Seems to be using caption model twice when uploaded long multi-files to UserData:

Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.36s/it]
/home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:318: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization
  warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
Overriding torch_dtype=None with `torch_dtype=torch.float16` due to requirements of `bitsandbytes` to enable model loading in mixed int8. Either pass torch_dtype=torch.float16 or don't pass this argument at all to remove this warning.
Loading checkpoint shards: 100%|██████████| 2/2 [00:08<00:00,  4.15s/it]

G-force78 commented 1 year ago

h20UI Was just about to post that I am not able to select the individual document in the UI, I see you have it on your list to implement. I hackily deleted tmp files under gradio as it kept answering about a machine learning idea and it did answer from the uploaded txt file (im using a collab notebook).

Which interaction method is best to query uploaded documents? prompt-answer, summarise etc?

pseudotensor commented 1 year ago

Hi @G-force78 , yes you can select only an individual document. Just click x near the drop-down icon so All becomes unselected, then select any number of documents, search by name, etc. Can be 1 document too.

As for what to put in the chat as prompt, whatever you like! Probably best is question/answer for a human.

G-force78 commented 1 year ago

Hi @G-force78 , yes you can select only an individual document. Just click x near the drop-down icon so All becomes unselected, then select any number of documents, search by name, etc. Can be 1 document too.

As for what to put in the chat as prompt, whatever you like! Probably best is question/answer for a human.

Sorry I meant what 'prompt type'? There are quite a few. When I asked it to summarize the document it basically just repeated it lol

pseudotensor commented 1 year ago

The vector search is very literal right now. If you ask "Summarize this document" that query won't match anything useful in the document selected.

So for now you should ask things like "What is the "named title" paper about? etc. Something semi-literal has to match in query and document for now.

h2oai / h2ogpt

LangChain task list #134