Startup Data collection - HELP needed

minh-vo commented 5 years ago

PROJECT CONTEXT: A research partner agrees to let me collect monthly data of 150 companies in their investment portfolio over 18-24 months, for the following categories.

Product development (what new feature? What can it do?)
Social Capital (who did you get help? how did you get help? what results?)
Human Capital (who did you hire? How did you get them? What for?) [so on and forth for another 5 categories]

In each category, the company will describe

what objective facts occur for this category in the last month?
what needs to get done for this given category in the next month? I will hand-code these responses and input it into my algorithm.

I will also have initial 'endowment' data on the companies (e.g initial founding team, business model elements, etc.)

I NEED FEEDBACK ON THE FOLLOWING: I think that an ML algorithm can do for me these analyses...

Anomaly detection (detect risky deviant changes in the startup)
Clustering e.g association rule learning, sequence classification (to abstract behaviors and strategies based on categorical/temporal combination of events)
Prediction of rare events (e.g follow-on funding, major customer acquisition, innovation)
cluster sequence/path dependence based on initial conditions (i.e business model, team composition, etc.)

...and the physics of the data is the following,...

Temporal event sequence: startup's life-cycle events are path dependent on past events
Tied events: Multiple events happen in the same interval)
Hierarchical data: events are in nested categories)
Multiple sequences: multiple variables taking state changes over time

...and therefore I consider appropriate and relevant the algorithms developed in medical science journals that predict disease events based on patient's past (sparse, irregular, incomplete) medical histories

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5148810/ : in this, "Each medical event is converted into a numerical vector that resembles its “semantics,” via which the similarity between medical events can be easily measured."
[I am stuck here, I am still reading up in this area]

Am I on the right path? How would you have done differently? What am I not thinking about? Thanks for help!

gricer01 commented 5 years ago

Hey Minh what's the research question or objective of the analysis you've described? If I understand correctly you will have high-dimensional data for 150 units, and you're wanting to cluster them into groups with similar company histories/initial conditions or make inferences about particular events in their company histories using the rest of the sample. I'm not sure how feasible this will be for a small sample and many dimensions, but maybe matrix completion methods are also worth looking into.

minh-vo commented 5 years ago

hi Richard,

Thanks for your idea.

You're correct- the sample size is sadly small. While I have interest in creating an algorithm that may predict failures and success of a more generalizable sample, the current sample may only produce ~150 outcomes, with the expected failure outcome to be 85% (normal startup failure rate). Left with 15% on successes, it would still be hard to associate what event histories (prior to startup success of course) are predictive of success. Still thinking my head around this. One way out is possibly to define multiple 'risky events/actions' (e.g co-founder turnover, long duration of flat revenue, etc.), for which one startup can may have a high number of outcomes (rather than a single one - dead or scaled-up).

How did matrix completion method to come mind, just curious?

tevgeniou / FoundationsML

Startup Data collection - HELP needed #3