kurtzace commented 1 year ago

ANN

3d plot

Part 2

AI ML Notes

Which model to choose

logit/class in tf

mnist class

CNN

image generator

rnn auto reg

Sequence data

building data for any sequential calc (any ARN/RNN/GRU/LSTM) Standardize then

simple rnn

LSTM eq fc and fh is generally tanh

if you return all hidden states- try max pool

ccn text

https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb

Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is Adjective(ADJ)- big, happy, green, young, fun, crazy, three Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow Preposition (P)- at, on, in, from, with, near, between, about, under Conjunction (CON)- and, or, but, because, so, yet, unless, since, if Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi

grammar = (''' NP: {

?*} # NP ''')

chunkParser = nltk.RegexpParser(grammar) tagged = nltk.pos_tag(nltk.word_tokenize(sentence)) tagged

tree = chunkParser.parse(tagged)

for subtree in tree.subtrees(): print(subtree)

from transforms import filter_insignificant

filter_insignificant([('the', 'DT'), ('terrible', 'JJ'), ('movie', 'NN')]) [('terrible', 'JJ'), ('movie', 'NN')]

Text Classification: Binary or multiclass classification

def bag_of_non_stopwords(words, stopfile='english'): badwords = stopwords.words(stopfile) return bag_of_words_not_in_set(words, badwords)

def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) return bag_of_words(words + bigrams)

naive bayes: P(label | features) = P(label) * P(features | label) / P(features)

from nltk.classify import NaiveBayesClassifier

nb_classifier = NaiveBayesClassifier.train(train_feats) nb_classifier.labels() ['neg', 'pos']

from featx import bag_of_words

negfeat = bag_of_words(['the', 'plot', 'was', 'ludicrous']) nb_classifier.classify(negfeat)

from nltk.classify import DecisionTreeClassifier dt_classifier = DecisionTreeClassifier.train(train_feats, binary=True, entropy_cutoff=0.8, depth_cutoff=5, support_cutoff=30) accuracy(dt_classifier, test_feats) 0.688

The third classifier we will cover is the MaxentClassifier class, also known as a conditional exponential classifier or logistic regression classifier. The maximum entropy classifier converts labeled feature sets to vectors using encoding. This encoded vector is then used to calculate weights for each feature that can then be combined to determine the most likely label for a feature set. For more details on the math behind this, see https://en.wikipedia.org/wiki/Maximum_entropy_classifier.

me_classifier = MaxentClassifier.train(train_feats, algorithm='gis', trace=0, max_iter=10, min_lldelta=0.5) accuracy(me_classifier, test_feats) 0.722

from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB sk_classifier = SklearnClassifier(MultinomialNB()) sk_classifier.train(train_feats) <SklearnClassifier(MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))>

from sklearn.feature_extraction import DictVectorizer from sklearn.preprocessing import LabelEncoder class SklearnClassifier(ClassifierI): def init(self, estimator, dtype=float, sparse=True): self._clf = estimator self._encoder = LabelEncoder() self._vectorizer = DictVectorizer(dtype=dtype, sparse=sparse)

def batch_classify(self, featuresets): X = self._vectorizer.transform(featuresets) classes = self.encoder.classes return [classes[i] for i in self._clf.predict(X)] def batch_prob_classify(self, featuresets): X = self._vectorizer.transform(featuresets) y_proba_list = self._clf.predict_proba(X) return [self._make_probdist(y_proba) for y_proba in yproba list] def labels(self): return list(self.encoder.classes) def train(self, labeled_featuresets): X, y = list(compat.izip(*labeled_featuresets)) X = self._vectorizer.fit_transform(X) y = self._encoder.fit_transform(y) self._clf.fit(X, y) return self

def _make_probdist(self, y_proba): classes = self.encoder.classes return DictionaryProbDist(dict((classes[i], p) for i, p in enumerate(y_proba)))

from sklearn.svm import SVC

sk_classifier = SklearnClassifier(svm.SVC()) sk_classifier.train(train_feats) <SklearnClassifier(SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False))> accuracy(sk_classifier, test_feats) 0.69

Precision = TP/(TP+FP) Recall = TP/(TP+FN)

   Expected
   +      -

Pred + TP FP

FN TN

sk_classifier = SklearnClassifier(LinearSVC()).train(train_feats)

accuracy(sk_classifier, test_feats)

from nltk.corpus import reuters len(reuters.categories())

For the first case, we'll be using execnet to do parallel and distributed processing with NLTK. For the second case, you'll learn how to use the Redis data structure server/database to store frequency distributions

sudo pip install execnet

gw = execnet.makegateway()

channel = gw.remote_exec(remote_tag)

Antlr

programmatic import 2.7

import imp, sys def import_module(name): fp, pathname, description = imp.find_module(name) try: return imp.load_module(name, fp, pathname, description) finally: if fp: fp.close()

from importlib import reload reload(math)

python -m timeit 'for x in xrange(50000): b = x**3'

counts = collections.Counter([1,2,3])

state_capitals = collections.defaultdict(str)

Person = namedtuple('Person', ['age', 'height', 'name']) jack = Person(age=30, height=178, name='Jack S.')

combined_dict = collections.ChainMap(dict1, dict2)

d = json.load(f) json.dump(d, f)

json.loads(s) json.dumps(d)

conn = sqlite3.connect('example.db') c = conn.cursor() c.execute('''CREATE TABLE stocks conn.commit() conn.close()

c.execute("SELECT * from table_name where id=cust_id") for row in c:

a = [1,2,3,4,5] b = list(itertools.combinations(a, 2))

[(1, 2), (1, 3), (1, 4), (1....

list(itertools.dropwhile(is_even, lst))

for i in zip_longest(a, b, fillvalue='Hogwash!'):

itertools.groupby(lst, key=lambda x: x[1])

list(it.accumulate([1,2,3,4,5])) [1, 3, 6, 10, 15]

for i in itertools.repeat('over-and-over', 3):

list(it.accumulate([1,2,3,4,5], func=operator.mul))

it.cycle('ABCD')

async def main(): print(await func()) async def func():

Do time intensive stuff...

return "Hello, world!" if name == "main": loop = asyncio.get_event_loop() loop.run_until_complete(main())

executor = ThreadPoolExecutor() result = await loop.run_in_executor(executor, func, "Hello,", " world!")

event = asyncio.Event() main_future = asyncio.wait([consumer_a(event), consumer_b(event)])

event loop

event_loop = asyncio.get_event_loop() event_loop.call_later(0.1, functools.partial(trigger, event)) # trigger event in 0.1 sec

complete main_future

done, pending = event_loop.run_until_complete(main_future)

session = aiohttp.ClientSession() self.websocket = await session.ws_connect("wss://echo.websocket.org")

self.websocket.send_str(message) result = (await self.websocket.receive())

from string import punctuation, ascii_letters, digits random.SystemRandom().choice(ascii_letters + digits + punctuation)

@lru_cache(maxsize=None) # Boundless cache def fibonacci(n):

base64.b64encode(s, altchars=None)

from Queue import Queue Queue().put() .get

from collections import deque popleft and appendleft

create and advance generator to the first yield

def coroutine(func): def start(*args,*kwargs): cr = func(args,**kwargs) next(cr) return cr return start

example coroutine

@coroutine
def adder(sum = 0):
while True:
x = yield sum
sum += x

Counter(adict.values()).most_common()

match = re.match(pattern, sentence)
match.groups()

copy.deepcopy(c)

@contextlib.contextmanager
def context_manager(num):
print('Enter')
yield num + 1
print('Exit')

with context_manager(2) as cm:

or
__enter__ __exit__

parser = argparse.ArgumentParser()
parser.add_argument('-g', '--greeting',
default='Hello',
help='optional alternate greeting'
)
args = parser.parse_args(

docopt uses usage string

sys.setrecursionlimit(limit)

urllib.request.urlopen(...).read()or code

from selenium import webdriver
browser = webdriver.Firefox() # launch Firefox browser
browser.get('http://stackoverflow.com/questions?sort=votes')
title = browser.find_element_by_css_selector('h1').text
questions = browser.find_elements_by_css_selector('.question-summary')

@BaseClass.foo.setter

kurtzace commented 1 year ago

NLP

Also from github's alex-sirota/tf-dev-nlp

Sentiment

pad zeros manually

prepare embedding matrix - if work not in glove - get hits vs miss

embedding trainable false also summarize all tokens into single review with lambda axis as 1 manual predict

Intent

Softmax loss types are Categorical Cross-entropy (when y is onehot) or SparseCategoricalCrossEntropy. WIth one vs many

in df - check intent.intent.value_counts() to check which ones are represented less

trim intents.groupby('intent').filter(lambda x: len(x)>=15).reset_index()

since y is text> map intent to number using

y_factorized, level_intent = pd.factorize(y_filtered)

y to be one hot

' then model

Text generation

Readup on RNN, GRU (remember / forget gate r and prev state), LSTM (GRU plus long term C vector)

pip install textblob gensim keras-nlp swifter

pandas has sample(frac=).reset_index()

use swifter to use parallel paradigm to apply to pandas rows

responses.swifter.apply(my_lambda_func)

make vocab of custom characters

char to id and back

look at https://github.com/axel-sirota/tf-dev-nlp/blob/main/module5/TF_Developer_NLP_Module5_Demo1_Text_Generation_Character.ipynb for steps to pad, preprocess, plus use swifter to get ids from tensor

using gru, return states, build graph, init rnn state

Perplexity - how random was text output

from_logit is true if you dont specify softmax.

Sparse catego (because no one hot encoding)

also see onestep model from TF docs and above git notebook link

References

Google: Sequence Models for Time Series and Natural Language Processing on Google Cloud "Al and Machine Learning for Coders: A Programmers Guide to Artificial Intelligence" by Laurence Moroney "Deep Learning", by Ian Goodfellow, Yoshua Bengio and Aaron Courville

RNN classify

CNN for text sequence

filter pattern matcher

simple classifier after tokenization with leaky relu

Predict

Preprocess

embedding layer

if glove is used

predict with glove embeddung

Mapping intents to numbers -> factorize

--

Old NLP arch - encoders (langauge changes/dimen/diff to detect similarity)

Use pretrained word embeddings

Process in text to usable tensor

RNN and self attention.

character handling

convert to mini batch

kurtzace commented 1 year ago

Transfer Learning

alternate - without data augmentation

kurtzace commented 1 year ago

AI Language Models and Foundation Models by Doru Catana

1954 George town IBM exp - 250 words - ru to en translate. 1980 - Statistics approach - HMM - hidden markov models.

2000s - NN, 2013 - word embeddings, RNN/LSTM . Transformer Arch 2017. 2018 - BERT (bi dire encoder rep trnasformer). 2018 to 2022 GPT, GPT2, GPT3. Roberta - improved on larger dataset. T5 - text to text transfer transformer, XLNet (overcome bert limits)

RNN has In, Hidden mem (context), Out.

Transformers ve Input/ Attention mech/ Positional encoding/ Output.

Foundational Model: BERT/GPT

Challenges: Scarcity of data, privacy/copyright, Biases, Scaling, Ethics.

Models:

activation functions (gatekeeper)
Param
loss func
optimizers
Regularization
attention mech

Diffusion models: Gen data similar to trained on. Add Gaussion noise and reverse the process. Eg: Dall-E (vision), Stable Fusion, MidJourney

uni or multimodal

Knowledge: t5

Transaltion: bert

Reinforcement: alphago

Audio: wavenet

Lang : GPT versatile, few shot.

high compute, bias risk,

Param: 175b for gpt3 and 1.76trillion for GPT4

bert: senti and lang understand. Opensource Param: Base 110 mil, large: 340 mi

T5: fast. Pram: 60mil, 11bil. Opensource. Tensorflow.

kurtzace commented 12 months ago

NLP's and Transformer Models by Axel Sirota

seq becomes longer - then BLEU score dropped

attention beings context/alignment

align is func of hidden and op state

Bahdanau stated

op is softmax

calc correct context for the word

so

many encoder rnns and hidden feed to context - this context is fed as attention to decoders

rnn -> attn -> rnn -> attn -> rnn (each rnn has in and out upwards (apart from attention))

github.com/axel-sirota/nlp-and-transformers/

nltk.download(word2vec_sample)

similarity

Query Key Value

3 batch size/sentences.
10 max tokens
100 vocab size

for multiheaded - drop out, number of heads (eg: 3)

which word to choose

greedy (max prob, but dont overpick common words)
beam search
accor to probab distrib - temp

translate for shorter seq

Transformer

Add residual and normalize to help remember from past.

op of last encoder is input to all of decoders

Maked multi headed (to avoid looking into future)

Smaller, faster (can parellize), fewer params,

Vaswani - Attn is all you need is above explanation

kurtzace commented 12 months ago

hugging face

pretrained: BERT, GPT
transformer library
datasets lib

from transformers import pipeline
senti_ana = pipeline("sentiment_analysis")
result = senti_ana (mySent)

to fine tune

distilled bert for senti

hugging face models outputs logits

for summary

xsum dataset

rouge eval

t5 model

every batch has differet size so data collator

adam fine tune lr and weights

kurtzace commented 6 months ago

articles

Training Data for the Price of a Sandwich https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/#recommendations-for-using-common-crawl-to-train-ai -> Interesting article about the training data used for generative AI.

kurtzace / diary2023

AI Notes #13

Part 2

Sequence data

[(1, 2), (1, 3), (1, 4), (1....

Do time intensive stuff...

event loop

complete main_future

create and advance generator to the first yield

example coroutine

NLP

Sentiment

Intent

Text generation

References

RNN classify

simple classifier after tokenization with leaky relu

Transfer Learning

AI Language Models and Foundation Models by Doru Catana

NLP's and Transformer Models by Axel Sirota

similarity

Transformer

hugging face

for summary

articles