jcrozum / pystablemotifs

Python library for attractor identification and control in Boolean networks
MIT License
28 stars 7 forks source link

Loading genome scale model #72

Closed yzongy closed 2 years ago

yzongy commented 2 years ago

Hi! It's great to see a tool developed for large scale boolean networks analysis.

I was trying to use pystablemotifs to analyse a GRN boolean model with about 2500 nodes. But I found it took a very long time to just load the booleannetwork.txt file. Just wondering if there is any hint to accelerate this process?

Best,

Bill

jcrozum commented 2 years ago

Hi Bill! I assume you're talking about running the command pystablemotifs.format.import_primes('/.booleannetwork.txt')? This can indeed be quite slow when the update functions are very large, i.e., when there are genes that are regulated by a large number of other genes. Unfortunately, there is no easy way to solve this issue in general. One "trick" that might help is to introduce intermediate variables for genes that have large update functions. For example, on a smaller scale,

X_A*= X_B & X_C & X_D & X_E & X_F | X_G & X_H & X_K

could be replaced by

X_A*= X_Y  |  X_Z
X_Y*= X_B & X_C & X_D & X_E & X_F 
X_Z*= X_G & X_H & X_K

As long as X_Y and X_Z don't share any variables, this is unlikely to have any effect on the attractor repertoire and should give a fairly sizable improvement on how long it takes to import the rule.

The downside of using this trick is that one of the steps in the attractor finding algorithm can sometimes attempt to undo the trick, which is very slow. If it is taking too long to build the attractor repertoire, I suggest using the MPBN_update=True option in the AttractorRepertoire.from_primes function. This may miss certain very fragile oscillations called motif-avoidant attractors, but oscillations of this type are extremely rare in asynchronously updated BNs, and are nonexistent in the MPBN update scheme described by Paulevé et al. in 2020 (hence the name of the option).

yzongy commented 2 years ago

Thanks, Jordan! Yes, I am running pystablemotifs.format.import_primes, which has been running for 8 hours already on my Mac with 2.4 GHz Dual-Core Intel Core i5. So I wonder if there is something wrong with the format of my input file? Regulators not being regulated are also included in my input file:

X_A*= X_Y  |  X_Z
X_Y*=
X_Z*=

I am not sure if something else is causing this slow loading issue, so I have attached an example file here. If you don't mind, could you please give it a try to see if you can load it?

GRN.txt

I appreciate it very much!

jcrozum commented 2 years ago

I've taken a quick look, and I am fairly certain that the problem is that some of the rules have too many variables in them for the code to process in a reasonable amount of time. Once you get to a dozen or so input variables for a rule, the importing process slows down dramatically. I am able to load a subset of the file just fine, so the formatting is not the issue. I am not able to load some of the longer rules unless I break them up using the trick I outlined above. For example, I am not able to load this:

YGL008C_fc*=  (YHR084W_fc) or (YIL131C_fc) or (YER040W_fc) or (YMR043W_fc) or (YGL073W_fc) or (YHR206W_fc) or (YHR178W_fc) or (YMR037C_fc) or (YNL216W_fc) or (YJR060W_fc) and not (YNL027W_fc or YDL056W_fc or YDL020C_fc or YDR146C_fc or YEL009C_fc or YKL185W_fc)

But I am able to load this in less than a second:

YGL008C_fc*=  (YHR084W_fc) or (YIL131C_fc) or (YER040W_fc) or (YMR043W_fc) or X1 or X2
X1*=(YGL073W_fc) or (YHR206W_fc) or (YHR178W_fc) or (YMR037C_fc) or (YNL216W_fc)
X2*= (YJR060W_fc) and not (YNL027W_fc or YDL056W_fc or YDL020C_fc or YDR146C_fc or YEL009C_fc or YKL185W_fc)

In this case, if your use-case allows it, I recommend breaking up the rules like this if they have more than 10 or so inputs. The reason this takes so long is that when you import the rules, the code computes a canonical form for the Boolean rules that makes subsequent operations much more efficient. Computation of this canonical form is fast for small rules, but scales poorly. As such, a larger number of smaller rules is more efficient than a smaller number of larger rules.

As an aside, I do want to caution you that

X_A*= X_Y  |  X_Z

is fine by itself, but adding the lines

X_Y*=
X_Z*=

will cause a crash (I did not see any such lines included in the example you sent me though, so it should be OK). If you leave the update function undefined for a variable, it will be treated as source node, i.e.,

X_A*= X_Y  |  X_Z

by itself is equivalent to

X_A*= X_Y  |  X_Z
X_Y*=X_Y
X_Z*=X_Z

That being said, if you have a lot of these, then you might want to specify values for the unregulated genes explicitly, e.g.,X_Z*=1, so that the code won't try to find attractors for all possible source input combinations.

yzongy commented 2 years ago

Thanks, Jordan!

I have written a parser to split lines as you suggested and also specified unregulated genes. But when an error raised when I was importing the file with pystablemotifs.format.import_primes:

ERROR failed to run bnet2primes: cmd=/Users/yzongy/.pyenv/versions/3.9.7/lib/python3.9/site-packages/pyboolnet/binaries/BNetToPrime/BNetToPrime_mac64, return_code=1, out=b'{"Q0050_fc":[[{"YIR018W_fc":0,\x08 \x08},\x08 \x08],[{"YIR018W_fc":1,\x08 \x08},\x08 \x08]],"YAL003W_fc":[[{"YDL056W_fc":0,"YER111C_fc":1,"YHR206W_fc":0,"YKL112W_fc":0,"YNL216W_fc":0,\x08 \x08},{"YDL056W_fc":0,"YEL009C_fc":1,"YHR206W_fc":0,"YKL112W_fc":0,"YNL216W_fc":0,\x08 \x08},{"YAL003W_fc_intermediate_1":0,"YDL056W_fc":0,"YHR206W_fc":0,"YKL112W_fc":0,"YNL216W_fc":0,\x08 \x08},\x08 \x08],[{"YAL003W_fc_intermediate_1":1,"YEL009C_fc":0,"YER111C_fc":0,\x08 \x08},{"YNL216W_fc":1,\x08 \x08},{"YKL112W_fc":1,\x08 \x08},{"YHR206W_fc":1,\x08 \x08},{"YDL056W_fc":1,\x08 \x08},\x08 \x08]

Was it caused by some conflicts or the wrong formatting of the input file? processed_output_boolNet.txt

jcrozum commented 2 years ago

It looks like there's a problem with the formatting: the intermediate node rules have mismatched parentheses, e.g.,

YHR087W_fc_intermediate_1*= YIL101C_fc) or (YMR172W_fc) or (YKL062W_fc) or (YIL131C_fc)

should be

YHR087W_fc_intermediate_1*= (YIL101C_fc) or (YMR172W_fc) or (YKL062W_fc) or (YIL131C_fc)

I saw this in all the intermediate rules I checked in the file you attached.

yzongy commented 2 years ago

Hey Jordan, it works!

I have fixed the missing parentheses and also break long "and not" lines.

For the GRN network with about 3500 genes, it took about 35mins to import the file and calculate attractors.

Thanks again for your help!

jcrozum commented 2 years ago

Great, I'm glad it worked for you!