kurtzace / diary2023

0 stars 0 forks source link

AI Notes #13

Open kurtzace opened 9 months ago

kurtzace commented 9 months ago

ANN image

3d plot image

Part 2

AI ML Notes

Which model to choose image

logit/class in tf image

image

mnist class

CNN

image generator image

rnn auto reg

Sequence data

building data for any sequential calc (any ARN/RNN/GRU/LSTM) Standardize then image

image

simple rnn

image

image

LSTM eq image fc and fh is generally tanh

if you return all hidden states- try max pool

image image

ccn text

image


https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb

Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is Adjective(ADJ)- big, happy, green, young, fun, crazy, three Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow Preposition (P)- at, on, in, from, with, near, between, about, under Conjunction (CON)- and, or, but, because, so, yet, unless, since, if Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi

grammar = (''' NP: {

?*} # NP ''')

chunkParser = nltk.RegexpParser(grammar) tagged = nltk.pos_tag(nltk.word_tokenize(sentence)) tagged

tree = chunkParser.parse(tagged)

for subtree in tree.subtrees(): print(subtree)


from transforms import filter_insignificant

filter_insignificant([('the', 'DT'), ('terrible', 'JJ'), ('movie', 'NN')]) [('terrible', 'JJ'), ('movie', 'NN')]


Text Classification: Binary or multiclass classification

def bag_of_non_stopwords(words, stopfile='english'): badwords = stopwords.words(stopfile) return bag_of_words_not_in_set(words, badwords)

def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) return bag_of_words(words + bigrams)

naive bayes: P(label | features) = P(label) * P(features | label) / P(features)

from nltk.classify import NaiveBayesClassifier

nb_classifier = NaiveBayesClassifier.train(train_feats) nb_classifier.labels() ['neg', 'pos']

from featx import bag_of_words

negfeat = bag_of_words(['the', 'plot', 'was', 'ludicrous']) nb_classifier.classify(negfeat)

from nltk.classify import DecisionTreeClassifier dt_classifier = DecisionTreeClassifier.train(train_feats, binary=True, entropy_cutoff=0.8, depth_cutoff=5, support_cutoff=30) accuracy(dt_classifier, test_feats) 0.688

The third classifier we will cover is the MaxentClassifier class, also known as a conditional exponential classifier or logistic regression classifier. The maximum entropy classifier converts labeled feature sets to vectors using encoding. This encoded vector is then used to calculate weights for each feature that can then be combined to determine the most likely label for a feature set. For more details on the math behind this, see https://en.wikipedia.org/wiki/Maximum_entropy_classifier.

me_classifier = MaxentClassifier.train(train_feats, algorithm='gis', trace=0, max_iter=10, min_lldelta=0.5) accuracy(me_classifier, test_feats) 0.722

from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB sk_classifier = SklearnClassifier(MultinomialNB()) sk_classifier.train(train_feats) <SklearnClassifier(MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))>

from sklearn.feature_extraction import DictVectorizer from sklearn.preprocessing import LabelEncoder class SklearnClassifier(ClassifierI): def init(self, estimator, dtype=float, sparse=True): self._clf = estimator self._encoder = LabelEncoder() self._vectorizer = DictVectorizer(dtype=dtype, sparse=sparse)

def batch_classify(self, featuresets): X = self._vectorizer.transform(featuresets) classes = self.encoder.classes return [classes[i] for i in self._clf.predict(X)] def batch_prob_classify(self, featuresets): X = self._vectorizer.transform(featuresets) y_proba_list = self._clf.predict_proba(X) return [self._make_probdist(y_proba) for y_proba in yproba list] def labels(self): return list(self.encoder.classes) def train(self, labeled_featuresets): X, y = list(compat.izip(*labeled_featuresets)) X = self._vectorizer.fit_transform(X) y = self._encoder.fit_transform(y) self._clf.fit(X, y) return self

def _make_probdist(self, y_proba): classes = self.encoder.classes return DictionaryProbDist(dict((classes[i], p) for i, p in enumerate(y_proba)))

from sklearn.svm import SVC

sk_classifier = SklearnClassifier(svm.SVC()) sk_classifier.train(train_feats) <SklearnClassifier(SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False))> accuracy(sk_classifier, test_feats) 0.69

Precision = TP/(TP+FP) Recall = TP/(TP+FN)

   Expected
   +      - 

Pred + TP FP

  • FN TN

    sk_classifier = SklearnClassifier(LinearSVC()).train(train_feats)

    accuracy(sk_classifier, test_feats)

from nltk.corpus import reuters len(reuters.categories())

For the first case, we'll be using execnet to do parallel and distributed processing with NLTK. For the second case, you'll learn how to use the Redis data structure server/database to store frequency distributions

sudo pip install execnet

gw = execnet.makegateway()

channel = gw.remote_exec(remote_tag)

Antlr image


programmatic import 2.7

import imp, sys def import_module(name): fp, pathname, description = imp.find_module(name) try: return imp.load_module(name, fp, pathname, description) finally: if fp: fp.close()

from importlib import reload reload(math)

python -m timeit 'for x in xrange(50000): b = x**3'

counts = collections.Counter([1,2,3])

state_capitals = collections.defaultdict(str)

Person = namedtuple('Person', ['age', 'height', 'name']) jack = Person(age=30, height=178, name='Jack S.')

combined_dict = collections.ChainMap(dict1, dict2)

d = json.load(f) json.dump(d, f)

json.loads(s) json.dumps(d)

conn = sqlite3.connect('example.db') c = conn.cursor() c.execute('''CREATE TABLE stocks conn.commit() conn.close()

c.execute("SELECT * from table_name where id=cust_id") for row in c:

a = [1,2,3,4,5] b = list(itertools.combinations(a, 2))

[(1, 2), (1, 3), (1, 4), (1....

list(itertools.dropwhile(is_even, lst))

for i in zip_longest(a, b, fillvalue='Hogwash!'):

itertools.groupby(lst, key=lambda x: x[1])

list(it.accumulate([1,2,3,4,5])) [1, 3, 6, 10, 15]

for i in itertools.repeat('over-and-over', 3):

list(it.accumulate([1,2,3,4,5], func=operator.mul))

it.cycle('ABCD')

async def main(): print(await func()) async def func():

Do time intensive stuff...

return "Hello, world!" if name == "main": loop = asyncio.get_event_loop() loop.run_until_complete(main())

executor = ThreadPoolExecutor() result = await loop.run_in_executor(executor, func, "Hello,", " world!")

event = asyncio.Event() main_future = asyncio.wait([consumer_a(event), consumer_b(event)])

event loop

event_loop = asyncio.get_event_loop() event_loop.call_later(0.1, functools.partial(trigger, event)) # trigger event in 0.1 sec

complete main_future

done, pending = event_loop.run_until_complete(main_future)

session = aiohttp.ClientSession() self.websocket = await session.ws_connect("wss://echo.websocket.org")

self.websocket.send_str(message) result = (await self.websocket.receive())

from string import punctuation, ascii_letters, digits random.SystemRandom().choice(ascii_letters + digits + punctuation)

@lru_cache(maxsize=None) # Boundless cache def fibonacci(n):

base64.b64encode(s, altchars=None)

from Queue import Queue Queue().put() .get

from collections import deque popleft and appendleft

create and advance generator to the first yield

def coroutine(func): def start(*args,*kwargs): cr = func(args,**kwargs) next(cr) return cr return start

example coroutine

@coroutine
def adder(sum = 0):
while True:
x = yield sum
sum += x

Counter(adict.values()).most_common()

match = re.match(pattern, sentence)
match.groups()

copy.deepcopy(c)

@contextlib.contextmanager
def context_manager(num):
print('Enter')
yield num + 1
print('Exit')

with context_manager(2) as cm:

or
__enter__ __exit__

parser = argparse.ArgumentParser()
parser.add_argument('-g', '--greeting',
default='Hello',
help='optional alternate greeting'
)
args = parser.parse_args(

docopt uses usage string

sys.setrecursionlimit(limit)

urllib.request.urlopen(...).read()or code

from selenium import webdriver
browser = webdriver.Firefox() # launch Firefox browser
browser.get('http://stackoverflow.com/questions?sort=votes')
title = browser.find_element_by_css_selector('h1').text
questions = browser.find_elements_by_css_selector('.question-summary')

@BaseClass.foo.setter
kurtzace commented 8 months ago

NLP

Also from github's alex-sirota/tf-dev-nlp

Sentiment

pad zeros manually image

prepare embedding matrix - if work not in glove - get hits vs miss

image

embedding trainable false also summarize all tokens into single review with lambda axis as 1 image manual predict image

Intent

Softmax loss types are Categorical Cross-entropy (when y is onehot) or SparseCategoricalCrossEntropy. WIth one vs many

in df - check intent.intent.value_counts() to check which ones are represented less

trim intents.groupby('intent').filter(lambda x: len(x)>=15).reset_index()

since y is text> map intent to number using

y_factorized, level_intent = pd.factorize(y_filtered)

y to be one hot

image ' then model

image

Text generation

Readup on RNN, GRU (remember / forget gate r and prev state), LSTM (GRU plus long term C vector)

pip install textblob gensim keras-nlp swifter

pandas has sample(frac=).reset_index()

use swifter to use parallel paradigm to apply to pandas rows

responses.swifter.apply(my_lambda_func)

make vocab of custom characters

image

char to id and back image

look at https://github.com/axel-sirota/tf-dev-nlp/blob/main/module5/TF_Developer_NLP_Module5_Demo1_Text_Generation_Character.ipynb for steps to pad, preprocess, plus use swifter to get ids from tensor

using gru, return states, build graph, init rnn state image

Perplexity - how random was text output image

from_logit is true if you dont specify softmax.

Sparse catego (because no one hot encoding)

also see onestep model from TF docs and above git notebook link

References

Google: Sequence Models for Time Series and Natural Language Processing on Google Cloud "Al and Machine Learning for Coders: A Programmers Guide to Artificial Intelligence" by Laurence Moroney "Deep Learning", by Ian Goodfellow, Yoshua Bengio and Aaron Courville

RNN classify

image

CNN for text sequence image

filter pattern matcher


simple classifier after tokenization with leaky relu

image image

Predict image

Preprocess image

embedding layer image

if glove is used image image

predict with glove embeddung image

Mapping intents to numbers -> factorize

image

--

Old NLP arch - encoders (langauge changes/dimen/diff to detect similarity)

Use pretrained word embeddings

Process in text to usable tensor

RNN and self attention.


character handling image

convert to mini batch image

kurtzace commented 8 months ago

Transfer Learning

image

image image

alternate - without data augmentation image

kurtzace commented 8 months ago

AI Language Models and Foundation Models by Doru Catana

1954 George town IBM exp - 250 words - ru to en translate. 1980 - Statistics approach - HMM - hidden markov models.

2000s - NN, 2013 - word embeddings, RNN/LSTM . Transformer Arch 2017. 2018 - BERT (bi dire encoder rep trnasformer). 2018 to 2022 GPT, GPT2, GPT3. Roberta - improved on larger dataset. T5 - text to text transfer transformer, XLNet (overcome bert limits)

RNN has In, Hidden mem (context), Out.

Transformers ve Input/ Attention mech/ Positional encoding/ Output.

Foundational Model: BERT/GPT

Challenges: Scarcity of data, privacy/copyright, Biases, Scaling, Ethics.

Models:

Diffusion models: Gen data similar to trained on. Add Gaussion noise and reverse the process. Eg: Dall-E (vision), Stable Fusion, MidJourney

uni or multimodal

Knowledge: t5

Transaltion: bert

Reinforcement: alphago

Audio: wavenet


Lang : GPT versatile, few shot.

high compute, bias risk,

Param: 175b for gpt3 and 1.76trillion for GPT4

bert: senti and lang understand. Opensource Param: Base 110 mil, large: 340 mi

T5: fast. Pram: 60mil, 11bil. Opensource. Tensorflow.

kurtzace commented 7 months ago

NLP's and Transformer Models by Axel Sirota

seq becomes longer - then BLEU score dropped

attention beings context/alignment

image align is func of hidden and op state

Bahdanau stated image

op is softmax image

calc correct context for the word

so

many encoder rnns and hidden feed to context - this context is fed as attention to decoders

rnn -> attn -> rnn -> attn -> rnn (each rnn has in and out upwards (apart from attention))


github.com/axel-sirota/nlp-and-transformers/

nltk.download(word2vec_sample)

image image

similarity

Query Key Value

image

image

image

image

for multiheaded - image drop out, number of heads (eg: 3)

which word to choose

translate for shorter seq

image

Transformer

image

Add residual and normalize to help remember from past.

op of last encoder is input to all of decoders

Maked multi headed (to avoid looking into future)

Smaller, faster (can parellize), fewer params,

Vaswani - Attn is all you need is above explanation

kurtzace commented 7 months ago

hugging face

from transformers import pipeline
senti_ana = pipeline("sentiment_analysis")
result = senti_ana (mySent)

to fine tune

image

image

image

distilled bert for senti

image

hugging face models outputs logits image

image

for summary

xsum dataset

rouge eval

t5 model

every batch has differet size so data collator image

adam fine tune lr and weights

image

kurtzace commented 1 month ago

articles

Training Data for the Price of a Sandwich https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/#recommendations-for-using-common-crawl-to-train-ai -> Interesting article about the training data used for generative AI.