Add remote data files as dependencies

TheChymera commented 10 years ago

I already mentioned this in one of our lengthier threads.

I keep my data (for this particular use case) in a directory on a remote server. As I generally have one file per participant, it's not appropriate to add all the files explicitly as deps - since pythontex would not notice the addition of a new file to the directory.

Has any new feature emerged which would let me specify a folder as a dependency?

Otherwise I was thinking of getting a nightly checksum of my folder and simply adding that file as a dep. Can you see any problems with this (as I understand pythontex would end up using checksums of checksums)? what's the best way to calculate the checksum of all the contents of a folder and then save it to that folder (bash-wise)?

gpoore commented 10 years ago

I'm still interested in having add_dependencies() take both individual files as well as directories. In the case of directories, that will require walking the tree. In principle, this won't be hard to implement. In practice, I expect users will want to be able to use wildcards, which will mean using fnmatch or something similar.

Anyway, if you want to add local dependencies a folder at a time, I can try to get that in the next release. And you could already use os.listdir() plus a loop involving add_depdendencies() to do essentially the same thing. If you want to add remote dependencies, then you will probably have to create your own system for walking the directory, as we discussed in #20. A general solution for remote dependencies seems difficult.

In terms of using a checksum file as a dependency, that would work, but the possibility of collisions will go up. I would probably go for a file that contains a hash of each of the files in the folder. You could use longer hashes or throw in some CRC checksums to make it more reliable.

I don't have any suggestions for calculating a checksum via bash, since I usually do everything in Python instead of bash.

TheChymera commented 10 years ago

I don't expect this to rock your world and save the day for PythonTeX directory deps - but here's how I manage loading my data files (from my local or remote directories, from this file).

    if source == 'server':
        from HTMLParser import HTMLParser
        import urllib
        class ChrParser(HTMLParser):
            def handle_starttag(self, tag, attrs):
                if tag =='a':
                    for key, value in attrs:
                        if key == 'href' and value.endswith('.csv'):
                            pre_fileslist.append(value)
        results_dir = data_path+experiment+'/px'+str(prepixelation)+'/'
        print results_dir
        data_url = urllib.urlopen(results_dir).read()
        parser = ChrParser()
        pre_fileslist = []
        parser.feed(data_url) # pre_fileslist gets populated here
    elif source == 'live':
        from os import listdir
        results_dir = path.dirname(path.dirname(path.realpath(__file__))) + data_path + str(prepixelation) + '/'
        results_dir = path.expanduser(results_dir)
        pre_fileslist = listdir(results_dir)
    elif source == 'local':
        from os import listdir
        results_dir = data_path + experiment + '/px' + str(prepixelation) + '/'
        results_dir = path.expanduser(results_dir)
        pre_fileslist = listdir(results_dir)

    print('Loading data from '+results_dir)

    if pre_fileslist == []:
        raise InputError('For some reason the list of results files could not be populated.')
    files = [lefile for lefile in pre_fileslist if lefile.endswith('.csv') and not lefile.endswith(ignore_filename+'.csv')]

    data_all = pd.DataFrame([]) # empty container frame for concatenating input from multiple files
    for lefile in files:
        data_lefile = pd.DataFrame.from_csv(results_dir+lefile)
        data_all = pd.concat([data_all, data_lefile], ignore_index=True)

Maybe we can integrate the HTML parser thing for pythontex?

TheChymera commented 4 years ago

Mirroring my update for https://github.com/gpoore/pythontex/issues/77#issuecomment-678606643 , I have decided to avoid runtime networking wonkiness, and chose to distribute my data separately from runtime operations.

So the remote bit, from my part at least, can be binned. Being able to add local directories as deps, however, remains relevant.

gpoore / pythontex

Add remote data files as dependencies #37