why is this for continuous action spaces?

M00NSH0T commented 6 years ago

Really cool that you've been working on implementing that algorithm in Python. I've been thinking of doing this as well. As far as I can tell, you're the only one that's tried doing this yet, so I'm not sure if you're looking for contributors, or if I should just go work on my own.

Either way, I'm just curious though, why is this implementation only for continuous action spaces? Isn't the point of the algorithm to bring the benefits of a DDPG to the large discrete space by mapping the continuous output to a set of discrete actions via the k-nearest neighbor approach? It looks like you've started working on that, but I'm still trying to figure out your code, so sorry if I'm misunderstanding.

Also, it looks like you're building your own nearest neighbor function here. Have you looked at using the one in sklearn?

jimkon commented 6 years ago

Hi.

I implemented this algorithm because i wanted to use it for my diploma thesis. I am currently working on a modification of this algorithm (in an other repo) that has an adaptable continuous action space, with increased resolution on places that is needed. That's why it works only for continuous. But it is very very easy (just 2-3 lines of code) to make it work for to discrete action spaces too. I can even help you do that if you want.

The part i need for my project is pretty much done so i have no plans on developing something else on this implementation right now (except a deep clean up because i know that the whole repo is a mess and has no results). I think that the code is fully function anyway (except the modification for discrete action spaces).

I am not building my own nearest neighbor search. I use the FLANN library which is the one that is suggested by the corresponding paper.

Bringing the benefits of DDPG to the large discrete action space is not the only point of this algorithm. There are benefits to continuous spaces too, through the "action refinement". As i developing this agent, i tried both DDPG and Wolpertinger on some domains with continuous action spaces, and it came out that Wolpertinger has better results almost all the times. This is just an observation i made, and i didn't make a research on it because i had to focus on my modification, but it was clear its happening.

Hope i answered you all your questions. If not, fell free to ask.

From: M00NSH0T notifications@github.com Sent: 12 February 2018 5:08 PM To: jimkon/Deep-Reinforcement-Learning-in-Large-Discrete-Action-Spaces Cc: Subscribed Subject: [jimkon/Deep-Reinforcement-Learning-in-Large-Discrete-Action-Spaces] why is this for continuous action spaces? (#31)

Really cool that you've been working on implementing that algorithm in Python. I've been thinking of doing this as well. As far as I can tell, you're the only one that's tried doing this yet, so I'm not sure if you're looking for contributors, or if I should just go work on my own.

Either way, I'm just curious though, why is this implementation only for continuous action spaces? Isn't the point of the algorithm to bring the benefits of a DDPG to the large discrete space by mapping the continuous output to a set of discrete actions via the k-nearest neighbor approach? It looks like you've started working on that, but I'm still trying to figure out your code, so sorry if I'm misunderstanding.

Also, it looks like you're building your own nearest neighbor function here. Have you looked at using the one in sklearn?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/jimkon/Deep-Reinforcement-Learning-in-Large-Discrete-Action-Spaces/issues/31, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AThtcU5_f8nd9tEePNZ0TBxE5Dl5xaocks5tUFQDgaJpZM4SCUDK.

M00NSH0T commented 6 years ago

Hey thanks for the quick response. You definitely answered all my questions. Now that I've read through your code a bit more, I see what you're talking about with the FLANN library. That's cool about how well it also works for continuous action space. Really, thank you for building this and putting it up in GitHub. I hope it gets you a job with a really awesome company, like Google. Maybe your name will be on one of these DeepMind papers one day!

I'm just working on a hobby project of sorts with my own custom openai environment, and I'm having some real problems with dimensionality. I started out with over 2500 discrete actions in a state space with something like 40k variables, and found it was just crawling along... each step was taking a full second or so to take even with running on a 980Ti/8700k, and that many variables and action space size, I wasn't really getting anywhere with any of the algorithms I tied even when I'd let it train for a few days (I suspect it would take months to find a viable policy).

I've been using Keras-RL for the most part, which I've modified to allow me to filter my actions each step to only allow valid actions. I also cut down the degrees of freedom of my system a bit, so I'm down to 1500ish actions, which with action filtering, comes down to 500 or so per time step. It's sped things up by a factor of 5, but I'm still not finding a usable policy in a reasonable amount of time. There are other things I want to try as well, like using shared layers in Keras to help learn quicker (my environment is mostly comprised of 515 identically sized vectors each describing similar objects concatenated together), but Keras-RL doesn't make that easy to do and it's no longer being maintained. That's what I've been working on for the past few days, but then I stumbled upon this paper yesterday and immediately got super psyched about this new approach. Then I found your repo and got even more psyched.

Anyway, I would love to try your code. Could I take you up on your offer to help with adding those couple lines of code to make it work with discrete action spaces? I would be super grateful.

thanks!

jimkon commented 6 years ago

Sorry for the late response. I didn't get a notification from your response and i just saw it.

Despite the title of this repo, this implementation is not dedicated to actually large action spaces. Even if the algorithm supports it, my code was never optimized to be fast. Handling so many actions each step, is a great deal. and it needs a careful optimization that wasn't my goal for this project so i never did. However, for the biggest action space that i tried, with 10000 action and 1000 nearest neighbors, the average step time never exceeded the 50-60 milliseconds. And despite some very unusual random points that it increased more than 200 ms (maybe case of things irrelevant of the algorithm itself) the step time is pretty stable. I have to mention here that i am not able to use GPU support for the tensorflow, because of my old graphics card. So i think there is another problem on your setup.

From my experience, the algorithm should start adapting to the problem after 1000-2000 episodes, or even 5000 on extreme cases. The average reward seems to slowly converge to a high value after a very high number of episodes. Unfortunately my laptop is not able to run all these episodes in a legitimate time. When i want to check the actual efficiency of a new version agent, i run a sample of 10000 episodes (for environment InvertedPendulum-v1) that takes a whole day for my old, low performance laptop. I think that it is a decent sample to check the performance of the agent, as it manages to bring results that i think are good. By good, i mean with a decent average reward. But even after so many episodes, with so many steps, and so much training in every step, there always will be plenty of episodes with low rewards. There are many reason i think this happening, but the point here is that it need some time, depending the environment, the reward system, the random initialization of the nets and some other things.

I am currently working on(almost finished) an update of the data of the agents performance that are stored and presented . After that, i will update the README file with plots with some representative results so it will be easier to check if thing run the way they should on your setup. Also i will add the option of the discrete action space since you need it.

Again. sorry for the late response.

jimkon commented 6 years ago

I forgot something. On action_space module, i initialize FLANN with the algorithm parameter set to kdtree. As presented on the paper of FLANN, this algorithm is better for low dimensional data, but its performance decreases quickly for higher dimensions. I set it like this, because i usually test the agent in low dimensional environments and i had to ensure it will use this nearest neighbor algorithm. If your action space has high dimensionality, you must change it according to the paper, or leave it unset and let the pyflann module to decide.

M00NSH0T commented 6 years ago

Hey no worries. I really appreciate you taking the time to get back to me at all honestly, and it's awesome you're willing to get the discrete part working. I've been able to get my step times down to like 10-20% of what they were originally, so as soon as I can use this with discrete action spaces, I'm definitely going to try it out.

jimkon commented 6 years ago

By discrete action space, you mean the discrete actions that are defined by the Discrete class contained on gym (a range of all the integers from 0 to n)? I realized that there are many different options for a discrete action space, so can you make it more clear what you mean exactly?

M00NSH0T commented 6 years ago

yes exactly. I have 1547 discrete integer actions that each do a completely different thing (i.e. they're not just different degrees of the same action type). If need be, I could probably get this to work with like maybe 50ish actions instead, but I'd lose a huge amount of potential reward by doing that, so I'm whittling my way down as needed. That's why I'm really hoping this will work for me. I have it all churning away on an A3C implementation at the moment, but after a couple thousand episodes, it's still got a long way to go. I might need to let it run a few weeks with A3C...

Anyway, here are the essential lines from my environment:

from gym import spaces
...
self.action_space=spaces.Discrete(1547)

jimkon commented 6 years ago

I added the functionality for the discrete action space. Check the action_space module, the class Discrete_space in branch discrete-and-continuous. All you have to do is to give the agent an environment with action_space=Discrete(n) where n is the total number of the actions you want. After that you can use all the actions in the range [0, n).

I haven't fully tested the code so i won't merge the branch with master for now. It worked for all the discrete environments of gym that i tried, but i don't have the time to ensure it. If you have any problem using it fell free to tell me.

M00NSH0T commented 6 years ago

Awesome! I'll check it out tomorrow.

M00NSH0T commented 6 years ago

Hey I'm trying to work with this, but I'm having some problems getting pyflann installed properly in Windows. I can install pyflann with pip, but I get tons of errors when I try running it, so I'm guessing I also need FLANN installed. However, the precompiled binaries for FLANN give me a hard error when I try installing and I'm getting errors building from source with the visual studio C++ compiler, so I'm going to have to try some other compilers out. I do also have Ubuntu on here, but some important pieces of my environment are currently only accessible in Windows (via a local MySQL database). Anyway, I'm going to have to figure out which will be less work - getting pyflann working in Windows, or getting MySQL running in Ubuntu. Probably the latter... Anyway, I'll let you know as soon as I have this sorted out.

caclemons commented 6 years ago

@M00NSH0T I've noticed your comments on Keras-RL and multi-discrete action spaces recently, as I'm working through similar challenges. Hate to hijack this thread, but since PMs aren't a thing here, could you shoot me an email at cclemons8@gatech.edu? I have some questions that I'd appreciate your opinion on. Thanks!

M00NSH0T commented 6 years ago

@caclemons sure no problem. I sent you an email just now. My highest hopes lie with this algo still, but I got sidetracked over the past week with some family stuff and haven't worked through my PyFlann issues yet. I'll be getting back to it over the next few days though.

M00NSH0T commented 6 years ago

I'm circling back to this finally... I just needed to convert the pyflann library from python2 to python3, as @jimkon kindly explained in their issues section. It was a very easy fix. Thankfully, it seems there's no need to install regular FLANN or anything like that, which would be a real pain in Windows.

M00NSH0T commented 6 years ago

Ok! I got it running with my custom environment. Seems to be handling my discrete action space without any errors. Too early to tell how well it's performing. The one thing I want to try to modify in here in the short term would be a way to filter the available actions to only consider the ones that are valid. Currently, I have it running just with very large penalties associated with invalid moves. With my other attempts, I was able to have my environment generate a numpy array of 1s and 0s that would state whether each action is valid or not. Ideally, I'd like to try to build that in here so when it does the k-nearest neighbor lookup, it doesn't just look for the one with the largest q-value, but rather applies this filter as well. @jimkon Any thoughts on how to do this easily? If not, don't sweat it... I'll figure it out eventually here.

I'm thinking I can do something in the wolp_action function. Like if I call env.get_valid_actions() I get a vector as long as my action space with ones and zeros. If I had this search say 100 nearest neighbors, at least a few of those should be valid moves. The one thing I'm trying to figure out here is what the vectors in here look like. Like is actions_evaluation a vector as long as the action space with q values in k slots and zeros everywhere else?

edit: ok I'm in debug mode picking my way through. Probably need to do something in action_space.search_point instead.

Also, the files this thing is saving for each timestep are like 300mb. Do I need to save that much, or can should I just be saving the past N number of files instead? I'll fill up even one of my 2TB drives in no time with this as is.

jimkon commented 6 years ago

If you know at each time which actions are not valid, you can throw them away (and even add new ones). The purpose of the whole action_space module is to make the action space dynamic. It is not obvious in this repo, because i transfer the whole adaption of the action space in an other repo. To do this you need firstly to delete your flann index "self._flann.delete_index()", assign the desired action space to self.__space and finally create another index "for example with self.rebuild_flann()". It is not a costly process to do at each episode, but if you want to do this at each step, maybe it's not the solution you're asking for.

As for the stored files, i don't know why the are so big in your project. The files that appear in the temp directory, are temporary saved files to clear some ram. When the simulation stops, they must merge in one large data file. If you see the data file i have uploaded, it is in total 79 MB zipped, and contains information for 10000 episodes. I have to inform you that it is one of the biggest files i have created from simulation since now. By the way you can choose what to save, or if you want to save data. If you don't want to save anything, you'd better disable the whole system.

M00NSH0T commented 6 years ago

Ok got it working. The results I'm getting with this so far blow away everything else I've tried. It was really important to build in that valid move piece though. In the end what I did was call my get_valid_actions() function from your main method, then I passed that vector to the Wolpertinger's agent.act() method along with the state. Note: I modified my get_valid_actions() function to return a vector that lists valid moves by integer reference rather than as an array of ones and zeros. This made it fairly trivial to act as a filter on your actions vector. Here's my modified wolp_action function (valid_actions is passed through agent.act() from main):

    def wolp_action(self, state, proto_action,valid_actions):
        # get the proto_action's k nearest neighbors
        actions = self.action_space.search_point(proto_action, self.k_nearest_neighbors)[0]
#        valid_actions=self.env.get_valid_actions()
        mask = np.isin(actions, valid_actions)
        actions=actions[mask].reshape(-1,1)
        if len(actions)==0:
            actions=np.array([1545,]).reshape(-1,1)
        self.data_fetch.set_ndn_action(actions[0].tolist())
        # make all the state, action pairs for the critic
        states = np.tile(state, [len(actions), 1])
        # evaluate each pair through the critic
        actions_evaluation = self.critic_net.evaluate_critic(states, actions)
        # find the index of the pair with the maximum value

        max_index = np.argmax(actions_evaluation)
        # return the best action
        return actions[max_index]

Note: 1545 is my "do nothing" action which I always want my environment to default to when it's unsure. I have to include this because if my knn factor is less than 1.0, it's possible that no valid actions will be within the k nearest neighbors. It doesn't seem to happen very often, but without that there, I'd get an invalid "0" action every episode or two.

There are still some questions I have regarding the data and data_process functions... the amount of data getting saved right now is nuts. I need to figure out how to trim down the files being saved here, but I'm apprehensive about doing that without understanding what all the code's doing. I'm going to work through that next.

Thank you so much for building this repo. This is really nice work.

M00NSH0T commented 6 years ago

@jimkon hey sorry I didn't see your last response before I responded... I was distracted by the awesome results I was seeing. What is this saved data ultimately used for here? Is it something I need to save for it to work properly (like network weights and the points in the point cloud), or is it just for analysis / plotting later on? And how would I use this after training? If the network weights don't get stored here, where do they get stored?

Last question: how would I properly disable it? Just comment out data.finish_and_store_episode()? I suspect part of the problem with these files being so large is I have 40k features, which is already trimmed down significantly. I probably also need to beef up the neural network behind this. I'll dig into the code more today either way to try to figure it all out more.

jimkon commented 6 years ago

I think that if you had a bigger action space, your filtering would be a very considerable cost to pay at every step. The purpose of wolpertinger architecture is to reduce step's time complexity to O(logN), and i think that even np.isin, and .reshape are very fast, they are still O(N). Maybe rebuilding your action space and flann index after each episode, would make the same result, with a significant drop on the time complexity. But you know better what works for you the best.

All the "data" you see, are for analysis and for better monitoring the functionality of the agent and the progress of the simulation. They store information at every step about the actions the agent picked, the rewards, observations and some other. You can just comment them out if you don't need them, and you'll be okay.

M00NSH0T commented 6 years ago

Understood, though the problem is that valid moves are determined at each step. Like move 145 might be valid at time step 4 but not time step 5, and so on. Definitely not ideal, but I figured better to at least filter out invalid moves before the critic evaluates them. It basically shrinks K each time step dynamically, but ofcourse I have to use a larger K initially than I ideally would to ensure there's something for it to pick from each turn. I'll look into it some more. Obviously my environment is having to also do some work to generate this vector, though relative to some of the other stuff it's calculating, the cost of that is pretty low.

Regarding the data - do the neural network weights get saved somewhere? I've been trying to comb through that DQN code to figure it out, but I'm not seeing it. I usually use Keras because it makes all that really easy to do (I haven't really worked directly with TF much), and I have a Keras-RL implementation of DQN that I might try to integrate instead of this one if the weights aren't getting saved. Like if I train this for 4-5 days, how do I use it after without retraining? And I assume the point cloud would have to be re-loaded as well. Is there a way to save just the point cloud and reload it later?

jimkon commented 6 years ago

No i don't save the weights because i am not trying to add up all the experience. My purpose is to help the agent to adapt and make him bring better results. So each time i run the simulation, the agent initializes its neural networks randomly. I think that actor and critic parts need a great "rework", not only in the way that they are initialized, but in the way they are trained too. But the surprising fact is that (in my case) even with so many sub-optimal parts and so much randomness, all the simulations are follow kind of a same pattern and bring similar results.

M00NSH0T commented 6 years ago

yeah man I'm seeing some really awesome results after just a few hundred episodes. I modified things a bit so I save the results each turn to a MySQL database, which another program queries every 5 minutes to generate a plot from, along with 50 episode moving average, and my moving average is beating anything else I've tried by a wide margin. Regular DQN and A3C were my next best, but even after 10k episodes, neither of them found an acceptable solution to my problem, whereas this seems to have zeroed in on a solid solution already and is continuing to improve. I'm going to let it run another 1000 episodes at least before I totally believe my eyes, but so far so good.

Anyway, I'm definitely going to try using a different network setup, built with Keras though. Once I finish my immediate problem, I'll see if I can generalize it so it can work with an Atari game or whatever and put it on Github, provided you'd be ok with me including some of your code. I'd obviously credit you for the pieces I use. Or I could just be a contributor on yours, though I think the changes I need to make are going to be pretty pervasive. Mostly I want to use your Wolpertinger Agent and the associated code with the point cloud, and tie that in with either a Keras-RL backend or build my own Keras based network that would let me save the weights and point cloud every X episodes. For my application, I need to be able to pretrain it, then have it work right away when I use it live. I can't have it relearn from scratch each time, even if it does do that fairly quickly. The advantage to me of this learning quicker is that I should be able to train my network in days instead of weeks or months. But I need to be able to update that training with new data, and maximize the likelihood that it will achieve a high reward as I run it just for one live episode after each cycle of training.

I'm still really trying to get my head wrapped around how this point cloud is really working. I just need a few days to really dig into this. Maybe that part doesn't need to be saved for this to work the way the neural networks do? Maybe I just need to save the actor / critic weights?

Thanks again for building this and being so responsive to all these questions by the way.

M00NSH0T commented 6 years ago

Ok I'm struggling to get how this conversion from the proto-action to the discrete actions is happening. I see both in the paper and in your code that the nearest neighbor lookup is happening using only the proto-action as input (i.e. there's no consideration of the state). But how does this work? It looks like there are several axis that get broken into uniform distributions of points, each representing one of the discrete actions, but I think something might be broken in the discrete space case here. The number of dimensions in the action_space gets set to the length of an integer (i.e. 1). So it's all happening on a single axis. Whereas in the paper, they talk about running the experiment with a varying number of axis (see figure 9)... they do it with both a 20 and a 200 dimensional representation.

Anyway, even with this apparently just working in a 1d represenation of the space, I'm having a hard time understanding how it's working. The proto-action is a double... ok so the nearest neighbor of a double on a single axis should be really easy to calculate. But wouldn't that just mean that in this case, FLANN isn't really doing much more than a linear conversion of some kind? If it were a 2d representation, how would both coordinates for the point be calculated.

I'm thinking maybe the output of the actor (i.e. the proto-action) is supposed to be a multi-dimensional array instead of a single double, like I'm getting now. This code would work if my proto-action had numerous dimensions, but for whatever reason, the current actor just gives me a single number. I think I just need to code into the config that the actor network should spit out multiple outputs (one for each dimension). Or am I missing the point, no pun intended, here? :)

Sorry for all the questions, this is just a bit confusing.

M00NSH0T commented 6 years ago

ok so I'm fairly certain the problem I'm having here is that the proto-action is supposed to be a multi-dimensional point, whereas what I'm seeing when I debug here is a single double. This looks like it's been hardcoded into the discrete space.

class Discrete_space(Space):
    """
        Discrete action space with n actions (the integers in the range [0, n))
        0, 1, 2, ..., n-2, n-1
    """

    def __init__(self, n):  # n: the number of the discrete actions
        super().__init__([0], [n - 1], n)

the last line is initializing the space object with a list of one element for low which in turn drives this to only have a 1 dimensional representation. I'm pretty sure for this to work correctly, the proto-action should have a variable number of dimensions that the user can manually set. In section 8.2 of the paper, Deepmind took a recommender task with 49 available actions (i.e. things to recommend) and represented that in both 20 and 200 dimensional space to see the perfomance change. Interestingly enough, the 20 dimensional representation did better. However here, it looks like we're taking that to the far, far extreme of a 1 dimensional representation, and I think that's actually going to be too general to work. That might explain why my implementation does pretty well (though not "great") right off the bat, but doesn't really improve, even when I pump up the nodes in the hidden layers.

I haven't figured out yet if the ddpg actor agent is taking any of this as input or if it's been hard coded in that it only spit out a single output too, but I'll be checking that out next.

jimkon commented 6 years ago

I didn't dig into FLANN very much, but i think it works with search trees. You insert the action space and you get an "index" which is kind of the root of the tree. Then you search using this index, among the nearest points you initially gave as action space. It's important to note that the search is approximate, and wont always return the actual k nearest neighbors, unless you set the precision parameter to one.
The code works the same for any number of dimensions. I have tried it with 1 and 2 and i am not sure if there is any bug that stops it for working with more. All the components have custom number of dimensions. This number depends on the env.observation_space and env.action_space that each gym.env defines.
There is only one point that the code is set to work for only single-dimensional spaces. This point is the Discrete_space i made for you, because you said you wanted to have a range of integers as action space. In the init function of Space class, you insert the low and high n-dimensional vectors to determine the lowest and highest values of each axis.
Inside Space, i normalize all the ranges to prevent the search of looking only one the most dense dimension. Think of having a 2d space with low=[-1000, -1] and high=[1000, 1], and 10 actions in each axis (10^2 total). In the first axis the actions would be every 200 units, and in the second every 0.2 . The nearest neighbor search would never look across the first axis.

If i didn't answer a question, you can ask again.

M00NSH0T commented 6 years ago

thanks. yeah that's all in line what what I've been able to piece together. The main problem here is that while using just one or two axis is probably fine for the continuous use case (think of just breaking down a steering wheel into 360 one degree parts one on axis), with the recommender systems (which is more like what I'm doing), you'd need a lot more axis to really work correctly. Like I guess if you take a 2d plane and place points on there each representing an action it could learn to hone in on the ideal action that way, but the neighbors around it may be completely unrelated and may even be really bad picks. With the steering wheel analogy, if you pick 181, 182, or 183 degrees, they'll all perform fairly well (i.e. on a 1d line, the nearest neighbors will perform similarly) and likely better than 23,24, and 25, but with recommender type systems 181,182 and 183 may be completely different movie genres or whatever and the neighbors you may want to draw from may be 2,69,723,etc. So you'd need those N-dimensional boxes to allow the point to settle near a border that lies between similar action types, which means N needs to be really high... of the same order of magnitude as the size of the action space. I don't think the size of the observation space shouldn't really factor in here (just with the actor/critic neural network setups), and with the action space just being a range of integers from low to high right now, it's just not going to be compatible.

Anyway, I'm understanding this a lot more now, and I'm pretty sure I know where to go with it now, so I'll spend today trying to make the FLANN piece work in higher dimensional space. Maybe it's just a matter of switching my action space over to a one-hot style vector... I'll figure it out.

btw - you're right about the trees, but regardless of how it's working, what it's doing is ultimately just looking for the nearest N-dimensional neighbor. I'm not really looking to get into those details either, but if you or anyone in the future is interested in reading up on it, the Wolpertinger article cited this paper which explains the concepts fairly well, and gives some nice visuals. Helped me figure out what my problem was at least: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6809191

Thanks again.

jimkon commented 6 years ago

FLANN, actor and critic works for any number of dimensions. In FLANN you just have to insert a NxM numpy array where N is the number of the points, and M is the dimensions. Actor and critic take the number of dimensions as input on the initialization. The whole code is written in a way to work with any number of dimensions. I don't really understand what's going wrong. If you need discrete action space (as defined in the openai gym) you can use the Discrete_space i made, if you need n-dimensional boxes then you just have to use the continuous action space and set the points coordinates to integers. Or else you should change the Discrete_space in a way that works for you.

M00NSH0T commented 6 years ago

I've been tracing my way through trying to debug. I think I'm almost there... been making modifications to the agent and space code so the user can just set the number of dims once manually. The problem is ultimately that a discrete space in openai is just really an integer... there's no Discrete_space.low vector or whatever. So the code is just making a one dimensional implementation of FLANN with 1547 (my number of actions) on that one axis, which just makes no sense. I does a good job of picking the optimal action, but the nearest neighbors to that on one axis are just random (or the ones that are numerically labeled close by, which can be random). Even on two axis, chances are most of the top actions wouldn't be anywhere near each other on that 2d plane. Really though, I think the dimensions (i.e. axis) used in FLANN should be a config setting in the agent or whatever... not something that's dependent on action_space.n or whatever. Maybe default it to one or two for the continuous case, and maybe default to the number of actions for the discrete case, but either way I think the user should be able to override this a bit easier... experiment around with it to find the ideal input, like they did in the paper because it looks like that has a huge impact on the overall performance of the algorithm. I'm currently trying to set that up... it's just taking some time to dig into where things are going wrong since most of my errors now are getting flagged within the FLANN code, and I'm having to step my way through that to figure out where some of these variables are coming from.

jimkon commented 6 years ago

The number of the dimensions is dependent on the problem you have to solve. And both the DDPG and Wolpertinger agent you find in this repo are set to work with openai gym. Gym's environments define the number of dimensions first, through the env.action_space on the src/ddpg/agent.py/Agent class and second through the env.step that restrict the input actions. You have to change these two to make all the components work correctly.

M00NSH0T commented 6 years ago

Yeah totally. That's why I think it should be a variable the user can control as opposed to be hard coded as the length of these low and high vectors, which coincidentally the Discrete action_space in OpenAI doesn't support.

env.action space seems like a bit of a mess in Openai. A lot of people I've seen comment on other repos in GitHub are having problems with some of it's subclasses (here's one of many discussions on it: https://github.com/openai/universe-starter-agent/issues/75). For instance, no libraries out there seem to support multi-discrete or tuple action space types because everyone assumes that there's an action_space.n variable, which those don't natively support. This leads a lot of people to figure out all the permutations that a multi-discrete action space would have and then just stick those in a giant discrete action space. It's really clunky and kind of a nightmare to work with, but that's what we seem to be stuck doing at the moment. The same people facing this problem are also the ones who would benefit from this Wolpertinger algo the most. But discrete action spaces don't have these high and low variables that the box does, and so the agent code isn't properly handling it.

I'm about to give up on using this DDPG agent implementation. It's requiring that something be in these high and low vectors for it's tensorflow grad_inverter call to work (and I'm not sure why it's calling that so I'm nervous about messing with it just yet). I've tried defaulting to np.zeros() and np.ones() for the low and high of the size of the action space, but that's not working correctly.

I think I just need to take a step back here and try to do what I was originally planning and migrate this over to use a Keras based DDPG or other some other actor/critic based algo. I'm not familiar enough with the base tensorflow code to mess with this DDPG implementation much directly, and I don't have enough time at the moment to learn it... maybe down the road.

my step function works correctly with a Discrete integer input.. at least it does with Keras-RL and the other OpenAI compatible libraries I've used.

jimkon commented 6 years ago

I think you have to understand better how actions_space module works. It's very simple and very basic. You can extend it to make whatever you want very easily. For example the low and high vectors that you need that much are already implemented in the continuous space and there are dozens of ways to make them work for a discrete multidimensional too. If you don't understand this part, i think you will have a big problem doing anything, even if you solve the problem you have now.

M00NSH0T commented 6 years ago

hey I think I'm not doing a good job explaining to you the problem I'm having. I understand that module fine. I appreciate the help you've given me. I'm just taking a step back and will make my own code that works directly with pyflann and another keras based actor/critic implementation. I understand pyflann pretty well now and am almost done getting it to work for the large discrete case in 100+ dimensional space. Would have taken my a lot longer to get here without seeing what do did here, so thank you again.

M00NSH0T commented 6 years ago

So with stepping back a bit and trying to work fresh with pyflann I was able to figure out my core issue here. Basically, I needed to build my own point cloud to seed pyflann with. The uniform space implementation just blows up as you increase the dimensionality (I'd run out of my 32gb of ram during initialization) and just doesn't generally make sense with non-continuous action spaces, so it just requires that each person who uses this logically think through the most sensible way to set up your problem's point cloud in whatever dimensional space makes the most sense so that adjacent points actually should be adjacent. I just did this in Excel with some formulae. I had the origin be "no action" and then I have 515 actions that pair up with counter actions, so those each became the ones and negatives ones on each of 515 axis. Then I had 516 more actions that didn't have counter actions, so those each got their own axis (ie they each became the one point on those axis with do nothing at zero). This left me with 1031 axes for my 1547 actions. Once I got that all mapped up, I was able to rework some of the other code to accommodate this setup. I had to modify my environment to accept either a single integer action or a 1031 unit long vector for its action input... and made a number of other clunky changes along the way, but it does run. I'm still going to try to port this over to another DDPG implementation though so I can save after training, but I just wanted to let you know I was able to make it run.

thanks again for building this and I'm going to close this thread.

hedongyan commented 4 years ago

Hello, I'm a new in "large action space". And I'm trying to do some work about large discrete action space. So will it work or could it be applied for 2^100 actions, such as 100 switches that can be open or close? And what were f and g in authors' paper?@M00NSH0T

jimkon / Deep-Reinforcement-Learning-in-Large-Discrete-Action-Spaces

why is this for continuous action spaces? #31