Open kurtzace opened 1 year ago
Also from github's alex-sirota/tf-dev-nlp
pad zeros manually
prepare embedding matrix - if work not in glove - get hits vs miss
embedding trainable false also summarize all tokens into single review with lambda axis as 1 manual predict
Softmax loss types are Categorical Cross-entropy (when y is onehot) or SparseCategoricalCrossEntropy. WIth one vs many
in df - check intent.intent.value_counts() to check which ones are represented less
trim intents.groupby('intent').filter(lambda x: len(x)>=15).reset_index()
since y is text> map intent to number using
y_factorized, level_intent = pd.factorize(y_filtered)
y to be one hot
' then model
Readup on RNN, GRU (remember / forget gate r and prev state), LSTM (GRU plus long term C vector)
pip install textblob gensim keras-nlp swifter
pandas has sample(frac=).reset_index()
use swifter to use parallel paradigm to apply to pandas rows
responses.swifter.apply(my_lambda_func)
make vocab of custom characters
char to id and back
look at https://github.com/axel-sirota/tf-dev-nlp/blob/main/module5/TF_Developer_NLP_Module5_Demo1_Text_Generation_Character.ipynb for steps to pad, preprocess, plus use swifter to get ids from tensor
using gru, return states, build graph, init rnn state
Perplexity - how random was text output
from_logit is true if you dont specify softmax.
Sparse catego (because no one hot encoding)
also see onestep model from TF docs and above git notebook link
Google: Sequence Models for Time Series and Natural Language Processing on Google Cloud "Al and Machine Learning for Coders: A Programmers Guide to Artificial Intelligence" by Laurence Moroney "Deep Learning", by Ian Goodfellow, Yoshua Bengio and Aaron Courville
CNN for text sequence
filter pattern matcher
Predict
Preprocess
embedding layer
if glove is used
predict with glove embeddung
Mapping intents to numbers -> factorize
--
Old NLP arch - encoders (langauge changes/dimen/diff to detect similarity)
Use pretrained word embeddings
Process in text to usable tensor
RNN and self attention.
character handling
convert to mini batch
alternate - without data augmentation
1954 George town IBM exp - 250 words - ru to en translate. 1980 - Statistics approach - HMM - hidden markov models.
2000s - NN, 2013 - word embeddings, RNN/LSTM . Transformer Arch 2017. 2018 - BERT (bi dire encoder rep trnasformer). 2018 to 2022 GPT, GPT2, GPT3. Roberta - improved on larger dataset. T5 - text to text transfer transformer, XLNet (overcome bert limits)
RNN has In, Hidden mem (context), Out.
Transformers ve Input/ Attention mech/ Positional encoding/ Output.
Foundational Model: BERT/GPT
Challenges: Scarcity of data, privacy/copyright, Biases, Scaling, Ethics.
Models:
Diffusion models: Gen data similar to trained on. Add Gaussion noise and reverse the process. Eg: Dall-E (vision), Stable Fusion, MidJourney
uni or multimodal
Knowledge: t5
Transaltion: bert
Reinforcement: alphago
Audio: wavenet
Lang : GPT versatile, few shot.
high compute, bias risk,
Param: 175b for gpt3 and 1.76trillion for GPT4
bert: senti and lang understand. Opensource Param: Base 110 mil, large: 340 mi
T5: fast. Pram: 60mil, 11bil. Opensource. Tensorflow.
seq becomes longer - then BLEU score dropped
attention beings context/alignment
align is func of hidden and op state
Bahdanau stated
op is softmax
calc correct context for the word
so
many encoder rnns and hidden feed to context - this context is fed as attention to decoders
rnn -> attn -> rnn -> attn -> rnn (each rnn has in and out upwards (apart from attention))
github.com/axel-sirota/nlp-and-transformers/
nltk.download(word2vec_sample)
Query Key Value
for multiheaded - drop out, number of heads (eg: 3)
which word to choose
translate for shorter seq
Add residual and normalize to help remember from past.
op of last encoder is input to all of decoders
Maked multi headed (to avoid looking into future)
Smaller, faster (can parellize), fewer params,
Vaswani - Attn is all you need is above explanation
from transformers import pipeline
senti_ana = pipeline("sentiment_analysis")
result = senti_ana (mySent)
to fine tune
distilled bert for senti
hugging face models outputs logits
xsum dataset
rouge eval
t5 model
every batch has differet size so data collator
adam fine tune lr and weights
Training Data for the Price of a Sandwich https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl/#recommendations-for-using-common-crawl-to-train-ai -> Interesting article about the training data used for generative AI.
ANN
3d plot
Part 2
AI ML Notes
Which model to choose
logit/class in tf
image generator
Sequence data
building data for any sequential calc (any ARN/RNN/GRU/LSTM) Standardize then
simple rnn
LSTM eq fc and fh is generally tanh
if you return all hidden states- try max pool
ccn text
https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb
Noun (N)- Daniel, London, table, dog, teacher, pen, city, happiness, hope Verb (V)- go, speak, run, eat, play, live, walk, have, like, are, is Adjective(ADJ)- big, happy, green, young, fun, crazy, three Adverb(ADV)- slowly, quietly, very, always, never, too, well, tomorrow Preposition (P)- at, on, in, from, with, near, between, about, under Conjunction (CON)- and, or, but, because, so, yet, unless, since, if Pronoun(PRO)- I, you, we, they, he, she, it, me, us, them, him, her, this Interjection (INT)- Ouch! Wow! Great! Help! Oh! Hey! Hi
grammar = (''' NP: {
chunkParser = nltk.RegexpParser(grammar) tagged = nltk.pos_tag(nltk.word_tokenize(sentence)) tagged
tree = chunkParser.parse(tagged)
for subtree in tree.subtrees(): print(subtree)
from transforms import filter_insignificant
Text Classification: Binary or multiclass classification
def bag_of_non_stopwords(words, stopfile='english'): badwords = stopwords.words(stopfile) return bag_of_words_not_in_set(words, badwords)
def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(score_fn, n) return bag_of_words(words + bigrams)
naive bayes: P(label | features) = P(label) * P(features | label) / P(features)
from nltk.classify import NaiveBayesClassifier
from featx import bag_of_words
The third classifier we will cover is the MaxentClassifier class, also known as a conditional exponential classifier or logistic regression classifier. The maximum entropy classifier converts labeled feature sets to vectors using encoding. This encoded vector is then used to calculate weights for each feature that can then be combined to determine the most likely label for a feature set. For more details on the math behind this, see https://en.wikipedia.org/wiki/Maximum_entropy_classifier.
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.feature_extraction import DictVectorizer from sklearn.preprocessing import LabelEncoder class SklearnClassifier(ClassifierI): def init(self, estimator, dtype=float, sparse=True): self._clf = estimator self._encoder = LabelEncoder() self._vectorizer = DictVectorizer(dtype=dtype, sparse=sparse)
def batch_classify(self, featuresets): X = self._vectorizer.transform(featuresets) classes = self.encoder.classes return [classes[i] for i in self._clf.predict(X)] def batch_prob_classify(self, featuresets): X = self._vectorizer.transform(featuresets) y_proba_list = self._clf.predict_proba(X) return [self._make_probdist(y_proba) for y_proba in yproba list] def labels(self): return list(self.encoder.classes) def train(self, labeled_featuresets): X, y = list(compat.izip(*labeled_featuresets)) X = self._vectorizer.fit_transform(X) y = self._encoder.fit_transform(y) self._clf.fit(X, y) return self
def _make_probdist(self, y_proba): classes = self.encoder.classes return DictionaryProbDist(dict((classes[i], p) for i, p in enumerate(y_proba)))
from sklearn.svm import SVC
Precision = TP/(TP+FP) Recall = TP/(TP+FN)
Pred + TP FP
FN TN
sk_classifier = SklearnClassifier(LinearSVC()).train(train_feats)
For the first case, we'll be using execnet to do parallel and distributed processing with NLTK. For the second case, you'll learn how to use the Redis data structure server/database to store frequency distributions
sudo pip install execnet
gw = execnet.makegateway()
Antlr
programmatic import 2.7
import imp, sys def import_module(name): fp, pathname, description = imp.find_module(name) try: return imp.load_module(name, fp, pathname, description) finally: if fp: fp.close()
from importlib import reload reload(math)
python -m timeit 'for x in xrange(50000): b = x**3'
counts = collections.Counter([1,2,3])
state_capitals = collections.defaultdict(str)
Person = namedtuple('Person', ['age', 'height', 'name']) jack = Person(age=30, height=178, name='Jack S.')
combined_dict = collections.ChainMap(dict1, dict2)
d = json.load(f) json.dump(d, f)
json.loads(s) json.dumps(d)
conn = sqlite3.connect('example.db') c = conn.cursor() c.execute('''CREATE TABLE stocks conn.commit() conn.close()
c.execute("SELECT * from table_name where id=cust_id") for row in c:
a = [1,2,3,4,5] b = list(itertools.combinations(a, 2))
[(1, 2), (1, 3), (1, 4), (1....
list(itertools.dropwhile(is_even, lst))
for i in zip_longest(a, b, fillvalue='Hogwash!'):
itertools.groupby(lst, key=lambda x: x[1])
list(it.accumulate([1,2,3,4,5])) [1, 3, 6, 10, 15]
for i in itertools.repeat('over-and-over', 3):
list(it.accumulate([1,2,3,4,5], func=operator.mul))
it.cycle('ABCD')
async def main(): print(await func()) async def func():
Do time intensive stuff...
return "Hello, world!" if name == "main": loop = asyncio.get_event_loop() loop.run_until_complete(main())
executor = ThreadPoolExecutor() result = await loop.run_in_executor(executor, func, "Hello,", " world!")
event = asyncio.Event() main_future = asyncio.wait([consumer_a(event), consumer_b(event)])
event loop
event_loop = asyncio.get_event_loop() event_loop.call_later(0.1, functools.partial(trigger, event)) # trigger event in 0.1 sec
complete main_future
done, pending = event_loop.run_until_complete(main_future)
session = aiohttp.ClientSession() self.websocket = await session.ws_connect("wss://echo.websocket.org")
self.websocket.send_str(message) result = (await self.websocket.receive())
from string import punctuation, ascii_letters, digits random.SystemRandom().choice(ascii_letters + digits + punctuation)
@lru_cache(maxsize=None) # Boundless cache def fibonacci(n):
base64.b64encode(s, altchars=None)
from Queue import Queue Queue().put() .get
from collections import deque popleft and appendleft
create and advance generator to the first yield
def coroutine(func): def start(*args,*kwargs): cr = func(args,**kwargs) next(cr) return cr return start
example coroutine