pywikibot-catfiles / file-metadata

A python package to analyze files and provide useful metadata
MIT License
15 stars 1 forks source link

Too many dependencies #44

Open AbdealiLoKo opened 8 years ago

AbdealiLoKo commented 8 years ago

This was raised by @zhuyifei1999 on IRC.

file-metadata is very heavy right now. If someone just wants to use video/audio analysis, they need to install scikit-image, and a whole lot of other things that are not necessary. I think the best bet would be to make use of extras_require in setuptools which allows us to add additional package dependencies like file-metadata[audio].

More suggestions are welcome :)

AbdealiLoKo commented 8 years ago

Another possibility mentioned by @drtrigon is that we could have multiple packages. For example, file-metadata-audio, file-metadata-video, etc.

There are 2 methods I can see of doing this:

drtrigon commented 8 years ago

Good point. We anyway may want to split the package a bit. However we have to be aware the fact that some of the libraries might sneak into other parts later as well (e.g. scikit-image).

drtrigon commented 8 years ago

Another 3rd possibility is the same as for the "Plugin mechanism" but without the plugin mechanism just divide the code up into separate packages. file-metadata will use all (actually be a kind of container or meta-package for them) but other projects will may be just use some.

jayvdb commented 8 years ago

There are two problems here:

The latter hurt much more, and are easier to fix. If we fix the former, and video2commons rewrites their backend to use file-metadata, then we can optimise the deps specifically for their needs, and then decide how to package/release/maintain it to include only those deps. It would be great to know what features it needs.

Improving the install time by splitting the dependencies doesnt help the gsoc project very much, as it needs all deps to work properly, and could make poor quality edits if it tries to work without some libraries by downgrading the analysis. That must be avoided at all cost. Optional dependencies is the most beneficial approach, as it allows catimages to run sooner, and provide runtime warnings/errors to users which can be more helpful than errors during setup.py/pip.

An easier way to reduce this installation pain is to improve upstream with wheels and debs, etc. That does also provide speed up to the catimages users installation also.

Improving the latter (runtime startup) does help the gsoc project, but only slightly as typically catimages runs for a long time, and in batch or continuous mode looking at recentchanges, so a minute or two of startup isnt very problematic. However if deps are only loaded when needed, some deps wont be needed for certain batches. E.g. processing only one file type.

zhuyifei1999 commented 8 years ago

It would be great to know what features it needs.

Stream analysis. More specifically FFProbeMixin

If splitting / optional dependencies can't be done I'll just fork that part off or wait till after gsoc :)

zhuyifei1999 commented 8 years ago

Some parts I don't get while reading that FFProbeMixin:

drtrigon commented 8 years ago

If I remember right we need a recent ffmpeg cause old ones miss json output format, @AbdealiJK can you falsify my statement?

Am 21. Juni 2016 15:22:35 MESZ, schrieb zhuyifei1999 notifications@github.com:

Some parts I don't get while reading that FFProbeMixin:

  • why does it seem to prioritize downloaded ffmpeg over system-provided ffmpeg? It's almost like preferring to run arbitrary code from elsewhere
  • for variable executable after which() calls, why is it not using absolute path? eg. it could be executable = executable or which('avprobe') or which('ffprobe') or download_ffmpeg()

You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/AbdealiJK/file-metadata/issues/44#issuecomment-227437588

Dr. Trigon

drtrigon commented 8 years ago

Improving the install time by splitting the dependencies doesnt help the gsoc project very much, as it needs all deps to work properly, and could make poor quality edits if it tries to work without some libraries by downgrading the analysis. That must be avoided at all cost. Optional dependencies is the most beneficial approach, as it allows catimages to run sooner, and provide runtime warnings/errors to users which can be more helpful than errors during setup.py/pip.

Yes, exactly!

Improving the latter (runtime startup) does help the gsoc project, but only slightly as typically catimages runs for a long time, and in batch or continuous mode looking at recentchanges, so a minute or two of startup isnt very problematic. However if deps are only loaded when needed, some deps wont be needed for certain batches. E.g. processing only one file type.

What about having a worker deamon running to avoid start up, you would just have to connect? Would that be possible?

Dr. Trigon

jayvdb commented 8 years ago

A worker daemon is possible, but I dont see the benefit over delayed imports of dependencies, wrt to startup time. It is better that file-metadata does its task in one thread, and the calling program put it into a thread (or many) if it is appropriate for that program. Anyway, if startup time is bad, we need to first see what is the cause with real timing data.

(note I am not requesting any timing data atm; MVP is the priority atm)

drtrigon commented 8 years ago

I wanted to mention a daemon for video2commons not file-metadata of course... ;)

zhuyifei1999 commented 8 years ago

Oh video2commons indeed use worker daemons. Flask on the frontend, celery on the backend. The backend barely suffers any start up delays due to having good redundancy and no NFS dependency, but the frontend is already way too slow, and any code update require a restart (unless in debug mode, which kind of allows arbitrary code execution).

zhuyifei1999 commented 8 years ago

And the frontend loads backend code (and vise versa) because, well, it's celery and they are in the same package.

drtrigon commented 8 years ago

@AbdealiJK: Do we currently profile the code (e.g. list each function calls timing)? Is this included in unittest/coverage? (on travis?)

Am 21. Juni 2016 17:54:39 MESZ, schrieb John Vandenberg notifications@github.com:

A worker daemon is possible, but I dont see the benefit over delayed imports of dependencies, wrt to startup time. It is better that file-metadata does its task in one thread, and the calling program put it into a thread (or many) if it is appropriate for that program. Anyway, if startup time is bad, we need to first see what is the cause with real timing data.


You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/AbdealiJK/file-metadata/issues/44#issuecomment-227485154

Dr. Trigon

AbdealiLoKo commented 8 years ago

No, we do not profile the code right now