Closed Bryanrt-geophys closed 3 years ago
Hello @smousavi05,
One more question in addition to the float type, I was looking through your code for the trainer function to try to determine if the attributes from either the csv or the hdf5 files are called by index or by name. I couldn't completely tell, but I suspect they are called on by name. Is that correct? I was just looking over the output from my h5py code and for some reason it mixes up the ordering of my attributes. I believe there is a way in h5py that you can dictate a specific order for each attribute. If EqT calls the attributes by index I would assume this is very important to do. Thanks again for your help!
Bryan
@Bryanrt-geophys Hi Bryan, my code uses float32 but 64 bit might be okay too. Yes, it calls everything by name. Make sure that your trace name has the same format and ends in ???_EV
Perfect! It looks like I've set things up right then.
I am still testing my small data set with 350 events. Since I built the trainer data with h5py my loss curves all have a inverted logarithmic shape leveling out near .03 but my f1 scores remain zero.
When trying to tune the trainer function what parameters are okay to vary? I've mainly tinkered with the distribution between partitioning training, test, and validation data. I've varied batch size and epoch length. Should I also be changing the weight values? (I'm taking a deep learning certification course rightnow and if I understand right those weights represent the alpha in the logistic regression function. If so I believe those need to be dialed in but not sure how to do this optimally. I'm still pretty early in the course.)
Thanks for all of your help along the way. It's been a great learning experience.
@Bryanrt-geophys no those weights are different, they are to balance the multiple loss functions used in the model. So leave them unchanged. The batch number is should be the most important parameters for you to change regarding your small training size.
So, now your loss curves do change during the training right? could you attach your learning curve plot?
@smousavi05, thank you for prompt response. Do you prefer issues have your tag in them to alert you? If not, I will stop doing so. I appreciate the help and don't want to become a nuisance.
I have ran a few variations of the below changing out the label function between box, triangle, and gaussian and changing the train/valid/test between 60/20/20 and 85/05/10. It's my understanding though that I want a larger validation and testing partition when I have a small training set, so I am sticking with the 60/20/20 until I append the training set to 100k + samples.
trainer(input_hdf5='NMTSO_test.h5',
input_csv='NMTSO_meta.csv',
output_name='test_trainer_60_triangle',
input_dimention=(6000, 3),
cnn_blocks=2,
lstm_blocks=1,
padding='same',
activation='relu',
augmentation=False,
drop_rate=0.2,
label_type='triangle',
mode='generator',
train_valid_test_split=[0.60, 0.20, 0.20],
batch_size=20,
epochs=20,
gpuid=None,
gpu_limit=None,
use_multiprocessing=True)
I am still pretty early on in the Deep Learning course I am taking. Does the cnn_blocks and lstm_blocks refer to how manny cnn or lstm hidden layers are used or possibly how many nodes are in those respective layers? What does the batch size refer to? Is batch a segmenting of the events passed into the network? I didn't see a definition to either of these in the documentation within the trainer function. I believe the variable drop_rate has to do with the droupout regularization method, correct? I am reading droupouts now but I don't think this is a terribly sensitive parameter to vary. Is that correct? It's also my understanding that ReLU is a largely standard activation function. Would you recommend experimenting with other activation functions like tanh, leaky ReLU, or sigmoid? My instinct is that this also does not need to be varied.
Okay, I just cracked into the subject of mini-batch gradient descent. I now understand the basic idea of what the batch parameter refers to. It sounds like with my small training set, it is optimal to use a full batch training rather than partitioning it into mini-batches. Once I append my data then I will want to partition into mini-batches to avoid non-converging stochastic gradient descent. From the basic intro on the subject I have viewed, it also sounds like it is recommended to split batches into sizes of 2^n
for RAM purposes. I am not sure if the entire training data set needs to be a multiple of the mini-batch.
Stick to the batch size of 20 for now.
Does the cnn_blocks and lstm_blocks refer to how manny cnn or lstm hidden layers are used or possibly how many nodes are in those respective layers? Not each block contains multiple layers, usually 2. see the supplementary materials of the paper.
the drop_rate=0.2. in the input is the droputout ratio. You can increase it as your dataset is small and may overfit easier.
There is another variable drop value in the code which is used for Adam optimizer.
Would you recommend experimenting with other activation functions like tanh, leaky ReLU, or sigmoid? I have also tested other activation functions, does not help.
You first need to make sure your data format is correct and the training works.
@smousavi05 Thank you for the further explanations and pointing me to the supplementary info from your paper.
I have one more question/issue I wanted to run past you before I dig into more reading and testing. I went back and ran the trainer() function on the provided 100samples.hdf5 file and got the same warning from tensoflow that I got when I ran my .h5 file. Could this be an issue with the way I have EqT and it's dependancies installed?
trainer(input_hdf5='100samples.hdf5',
input_csv='100samples.csv',
output_name='test_trainer_fullFunct_normStd',
input_dimention=(6000, 3),
cnn_blocks=5,
lstm_blocks=2,
padding='same',
activation='relu',
drop_rate=0.2,
shuffle=True,
label_type='gaussian',
normalization_mode='std',
augmentation=False,
add_event_r=0.6,
shift_event_r=0.99,
add_noise_r=0.3,
drop_channel_r=0.5,
add_gap_r=0.2,
scale_amplitude_r=None,
pre_emphasis=False,
loss_weights=[0.05, 0.40, 0.55],
loss_types=['binary_crossentropy', 'binary_crossentropy', 'binary_crossentropy'],
train_valid_test_split=[0.60, 0.20, 0.20],
mode='generator',
batch_size=32,
epochs=20,
monitor='val_loss',
patience=12,
multi_gpu=False,
number_of_gpus=4,
gpuid=None,
gpu_limit=None,
use_multiprocessing=True)
Epoch 1/202021-06-04 15:07:22.138048: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-06-04 15:07:22.694861: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-06-04 15:07:23.566531: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] model_pruner failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'loss_3/detector_loss/binary_crossentropy/weighted_loss/concat' has self cycle fanin 'loss_3/detector_loss/binary_crossentropy/weighted_loss/concat'.
2021-06-04 15:07:24.287555: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: MutableGraphView::MutableGraphView error: node 'loss_3/detector_loss/binary_crossentropy/weighted_loss/concat' has self cycle fanin 'loss_3/detector_loss/binary_crossentropy/weighted_loss/concat'.
2021-06-04 15:07:24.439517: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] arithmetic_optimizer failed: Invalid argument: The graph couldn't be sorted in topological order.
2021-06-04 15:07:24.559227: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-06-04 15:07:25.082390: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-06-04 15:07:25.896886: W tensorflow/core/common_runtime/process_function_library_runtime.cc:675] Ignoring multi-device function optimization failure: Invalid argument: The graph couldn't be sorted in topological order.
Do you get the same warnings when you trained with the provided file?
If this is expected from the 100samples file then I will go back to the drawing board on tuning and increasing my data size. Thanks for your time looking over this!
this is a new warning. Might be related to new updates. Could you reinstall EqT from Github source. That version support TF 2.5.0
@smousavi05, I created a fresh virt env, pulled the GitHub version of EqT and ran the python setup.py install
command in the env but I received the fallowing error:
error: h5py 2.10.0 is installed but h5py~=3.1.0 is required by {'tensorflow'}
It appears tensorflow 2.5.0 requires a higher version of h50y than is in the setup.py. Is it okay if I change the setup file to have h5py 3.1.0?
you can just update h5py. In your new vir env type: conda install h5py==3.1.0
@smousavi05. Awesome, I have successfullly installed the newest version of EqT and ran makeSationList
, downloadMseeds
, and preprocessor
without issue.
However, when I now go to run
trainer(input_hdf5='100samples.hdf5',
input_csv='100samples.csv',
output_name='test_trainer_fullFunct_100samples',
input_dimention=(6000, 3),
cnn_blocks=5,
lstm_blocks=2,
padding='same',
activation='relu',
drop_rate=0.4,
shuffle=True,
label_type='gaussian',
normalization_mode='std',
augmentation=False,
add_event_r=0.6,
shift_event_r=0.99,
add_noise_r=0.3,
drop_channel_r=0.5,
add_gap_r=0.2,
scale_amplitude_r=None,
pre_emphasis=False,
loss_weights=[0.05, 0.40, 0.55],
loss_types=['binary_crossentropy', 'binary_crossentropy', 'binary_crossentropy'],
train_valid_test_split=[0.60, 0.20, 0.20],
mode='generator',
batch_size=32,
epochs=20,
monitor='val_loss',
patience=12,
multi_gpu=False,
number_of_gpus=4,
gpuid=None,
gpu_limit=None,
use_multiprocessing=True)
I am presented with these errors
trainer() got an unexpected keyword argument 'multi_gpu'
trainer() got an unexpected keyword argument 'number_of_gpus'
Once I comment out those variables in the function and rerun it I recieve this error:
`class_weight` is only supported for Models with a single output.
Have you ran across similar issues?
No that is new. I had not tested the trainer though.
@smousavi05
I don't know all the nooks to GIT, can I pull a prior version of EqT to test if any bypass the tensor flow issues I've ran into?
The 'class weight' error I think is based in Keras. I tried doing some reading on it, but I couldn't figure much of it out.
As always, I appreciate your help.
not from Git but you can install the previous stable version from pip. Yes, that is the Keras error. They have changed things and moved lots of functions to TF.
And the versions downloadable from pip and conda are the same, correct?
NO, older versions are available on pip and anaconda compared to the one on Git
Okay, I just read up that when someone pulls a repo, you pull all version history. git log
will list all the commit history for the repo. From there, you can git checkout <numerical_designation>
to change to that version of the project. I just did the pip install though. I just thought I would pass on the learning.
Also, I realized my hdf files only have a group for signal. I assumed a group for noise was unnecessary as I didn't see a group for noise when I ran keys()
on the opened hdf5 file. I only saw a data group with ?_EV.
that would be okay.
Hello @smousavi05, sorry I have left this issue open for so long. I have made good progress using EqT to build my own model. I have successfully expanded my training set outward to 10k+ examples and I am seeing great improvements in the loss curves and the F scores. However, with the most recent iteration of my pipeline to continue to increase my data set size, it appears that some null values were introduced in a format that is not excepted by EqT. I think this is because some columns that only had char or int values now have NA values and I am using an NA value with a different data type.
As I wrote my pipeline in R and it wraps a python function applying h5py to create the file, I am assuming it has to do with the way I have formatted the null values. I noticed in your 100samples.csv file example that there are NaN as well as None values provided. Are these handled fundemntally different than one another in python or are they interchangeable? In my pipeline, I have been explicitly using NA values.
Can I change that to all NaN or do I need to insure columns with char type's have nan while cloumns with floats/ints have None?
I will be sure to close this issue after this correspondence.
Actually, I found the error and it wasn't with the null value formatting. It was with the iteration of my pipeline, I had someone produced duplicate rows in the CSV file that weren't in the h5 file. I removed them and this fixed the issue. I will close this now.
I have been wrestling with the formatting of my h5 attributes, as I showed you in my last post. Because of this I switched over to using R's {reticulate} package. This lets me insert python code directly into my R code, so I have been working to learn h5py an implement that. I have some minor questions though about dtype's.
R's standard float is a float64. Is that acceptable for EqT or do I need to look up how to convert float64 to float32?
My Attempted Python Code
A quick note, because the pipeline I have built to prepare my data up to this point is in R, the reticulate package lets me start a python chunk between
repl_python
andexit
. Within the chunk I can call objects from R into Python using the syntaxr.<r_object_name>
.