Ask Dr. Bortnik about using UCLA resources for computing and get environment specifications

vc1492a / tidd

An approach for detecting tsunamis using anomaly detection anomalies on sTec d/dt data from orbiting GPS satellites.

Other

5 stars 1 forks source link

Ask Dr. Bortnik about using UCLA resources for computing and get environment specifications #40

Closed vc1492a closed 4 years ago

hamlinliu17 commented 4 years ago

@vc1492a I will most likely be requesting computing resources from the hoffman2 cluster from UCLA . Here is their website that should list the environments that are available.

vc1492a commented 4 years ago

Thanks for sharing @hamlinliu17! The environment I use is a PowerPC machine with NVIIDA Tesla P100 GPUs - this is what I have available to me. At this stage, it's not clear how powerful of a GPU we will need but the P1000 has 3584 cores with 16GB memory, as a point of reference. This is somewhat comparable to the Tesla P4 on the Hoffmann 2 cluster.

I think we will be able to construct a list of specifications such that the same top-level library (FastAI) can be used while different versions of PyTorch and CUDA to support the different computing environments.

Since you will need to request time on the cluster it seems, we ought to use the computing resources at UCLA for a specific set of experiments and first develop locally - we will need to also make sure we can train models easily without the use of a GPU while we do some work in a sort of sandbox before we submit the work to a GPU cluster for processing.

hamlinliu17 commented 4 years ago

@vc1492a I have received approval to access UCLA computing. Will try and run the model code on the this branch this week

vc1492a commented 4 years ago

Sounds good! There isn't much there yet but that branch will at least get you started on testing your environment.

hamlinliu17 commented 4 years ago

@vc1492a I am able to get my environment up and running and I am able to run some of the code from Baseline Model Experiment.ipynb. ~~Will close this issue and move it on the board.~~ I do get some issues with disk quota though, so i will look into this, I will reopen the issue for now

vc1492a commented 4 years ago

That's a great update @hamlinliu17! We can work on that. If you have a lot of RAM (memory) available outside of the GPU, we can host the files on AWS S3 and then pull them into memory during experimentation instead of using any local disk. This will cost some money in data egress though and should be limited if possible. Let me know if you can get the local disk quota resolved or if you can identify whatever that quota is (in MB or GB etc.).

hamlinliu17 commented 4 years ago

@vc1492a Turns out the local disk is only 20 GB but they do give me access to a scratch folder that has 2 TB of memory but files are only there for 14 days. The problem is that when running the models, they are saved onto my local disk. Is there a way to specify the path from which the models/history is saved? If not, I can just move the run the models from the scratch repository when I run it.

vc1492a commented 4 years ago

I think I know what you're asking. The lr.save() command saves. the models / history to a location of your specification (it's just a path). Change the path here perhaps to save the model history to your own directory. I'll add a to-do note on my end to make this parameter something that is specified in the model specification cell in the notebook.

That should give you the ability to save the model to the scratch folder - let me know if this is helpful or if I missed the point of your question!

hamlinliu17 commented 4 years ago

@vc1492a I set the learner path to the scratch repository and it seems to be working. Just to be sure we are using the Learner from this webpage right?

Edit: It seems that the training is taking longer than expected. It seems that from the tqdm, it says it will take ~5 hours to train. I was wondering if this was normal

Edit 2: ~~It seems to have stopped at the 26th Epoch and I was wondering if this was correct behavior? It gave me the message Epoch 26: early stopping after 7 mins of training~~ Never mind, I saw the message at the beginning about early stopping. Attached is the results from the model.

vc1492a commented 4 years ago

Hey @hamlinliu17! Glad you have it working as well now! That's great - I'll push an updated version of the notebook soon that will contain the save location as a parameter (so that we can both use the code as is, perhaps we can load some things in as environmental variables).

A lot of my results look similarly - it will be important for us to build out a mechanism for tracking model parameter settings and architectures, the end results etc. We can go ahead and start building out some of the metrics capability if we'd like to!

Since it looks like you can use the resources fine now, please feel free to close this issue if that's the case. Thanks!