if you lack constraints on datastores then duplicates will occur
how to create setup.py
hypothesis can fuzz mysql to make sure the data going in and back out is the same
assume during data ingestion that you'll have duplications/redundancy - how to spot and remove?
starting point for data ingestion - assume this is a sequence of processes that build on each other, not a single process with all the steps done at once. this way you can swap things in, test in isolation and scale to more machines
list some text similarity metrics fuzzywuzzy, levenshtein, note doing char or word based similarity or char n-gram similarity, maybe removing punctuation/case/unicode is useful?
pandas read_csv dayfirst=False (by default, consider different for euro poorly specified dates)
more clean data (probably) beats smarter algorithms
clustering for EDA
t-sne in sklearn, visualisations https://lvdmaaten.github.io/tsne/ to help understand what to expect (stuff close in n-dimensions should be close in 2d)
if during cleaning you have to deal with internationalised code (e.g. Russian "Альфа-Банк") be aware that if you lack tests then a naive bit of processing (e.g. lowercasing and some cleaning rules in C#) might give you "?????-????", which you blindly store in database - this is a danger for mix-programming-language transformations (C#'s .net rules vs Python's rules) where they do different things
example of bad encoding twice " Électricité de France "
list project-types that might work and why, @springcoil talks on the requirement to invest in tooling to deliver working systems
r&d != engineering
how might r&d (e.g. 1 person) interface with an eng team?
which bits of an agile process seem to work well? do sprints work well (depends on the task-type)?
how 'owns' the data/process, can that cause problems?
does the lack of a shared language hinder things?
data scientists need clean data, the system will probably always have some dirty data, there is a need for a data-cleaning process (data eng team?) who try to improve the data quality to an agreed schema and who can export/transform the data so it can be used by the r&d team
building mini-monolithic-blocks is normal, remember to break them up into smaller services that can be tested else critical testing can easily be avoided (costing later development speed)
add logging early for anything production-like
luigi for task pipelines to avoid manual steps
getting hired:
what you need to show if you want to get hired (github, talks)
minimal stuff you should do to be more visible
list of tools I'd like to see
auto-possible-euro-datetime-checker (icy.py?) for pandas when reading ambiguous datetimes
string->unit converter (e.g. for relative times like "7 minutes" and weights and measures e.g. "23cm", "1inch", "1 in.", "2000m", "2kilometres", "1 pound", "23oz.", "0.25kg")
learning strategies
clustering for EDA
cleaning
process
getting hired:
list of tools I'd like to see
further reading
pipeline building
tools on my radar
review: