Determine how we might use AI functionality in GCP

DeeMcCart commented 2 weeks ago

EPIC: #10

As a AI developer I want to determine the functionality/capabilities to achieve build AI scope for the project

Assumptions or Pre-Requisites:

Project idea agreed #1
Looking at 'Agent' type AI functionality which is likely to be multiple AI tools/ possible multi-step process
We are assuming at this point that output is likely to NOT include image or video generation
Output MAY include: text, text-to-speech, representations of numeric data (e.g. timelines bar charts).........

Acceptance Criteria: (Must be completed before task is moved to 'Done')

[x] Must know what we are doing
[ ] Must

Tasks

[x] Task1 24/08/24: Investigate/ explore GCP functionality & documentation to build 'Agent'
[x] Task2 Investigate foundational model build
[x] Task3 Investigate ETL / Pipeline model build
[x] Task4 Attempt model training model1, model2, model3

Before changing task status to 'Review' or 'Done' please provide comment (and screenprints if appropriate) as documentary evidence of task completion

DeeMcCart commented 2 weeks ago

Choices:

Model to use for training (most appropriate to our problem) Supervised Learning vs RLHF:
Supervised Learning - use when less variablity in data (e.g. for our dataset, this might relate to income vs house price to generate deposit amount required; similarlly for the help-to-buy schemes it might identify which ones the user is eligible for RLHF (Reinforced Learning from Human Feedback)

Number of Epochs : Iteration cycle in learning mode Adaptor size: relates to multi-threaded parallelism: More complex tasks may benefit from larger adaptor sizes.

Tuning a dataset e.g. JSONL data input, structured as key-value pairs Model validation e.g. classification - summarisation - extractive AI - chat For lovely examples of existing datasets, See hugging face datasets A report Adapt-LLM Finance-chat exists

Needs data pipeline are we gonig to scrape from internet are we going to get structured data from csvs etc class to do all the data pre-preparation in one go each JSONL file gets a name and a version - train it on, e.g. mortgage calculator.
What is the difference between JSON and JSONL? In summary, the key difference is in how they handle multiple JSON objects. Regular JSON files are typically a single, self-contained structure, while JSON Lines use a line-by-line format, allowing for easier streaming and processing of individual objects.

Note - Good data is essentail - cleansing might be needed (e.g. missing values) Pre-processing is essential (to

DeeMcCart commented 1 week ago

In actual fact the mode, datasets integration, and training process took place over multiple time periods during the week coming up to 30th August.
End result was:
1 foundational model per training cycle (incremental training not possible) Using foundational rather than RAG (although RAG was tested; it can be used to build knowledge 'from the ground up', while foundational model builds on existing knowledge) Moved this issue to 'done' 02/09/24 as part ofKNban board cleanup

vinnieOrdobas / ci_national_ai