lhatsk / AlphaLink

AlphaLink: Integrating crosslinking MS data into OpenFold
Apache License 2.0
64 stars 17 forks source link

inference without MSA #25

Open sehooni opened 2 months ago

sehooni commented 2 months ago

Hello, thank you for great research.

In the AlphaLink paper, i think that you test without MSA in Fig.2 (e,f).

than, how can we make the dataset without MSA.

In my work, I have some test set with MSA. But I want to revise them without MSA feature. In inference time, I tried with neff 0, but it didn't work.

# subsample MSAs to specified Neff
msa = feature_dict['msa']

if args.neff:
    logger.info(
        f"Subsampling MSA to Neff={args.neff}..."
    )
    indices = subsample_msa_sequentially(msa, neff=args.neff)
    feature_dict['msa'] = msa[indices]
    feature_dict['deletion_matrix_int'] = feature_dict['deletion_matrix_int'][indices]

Can I get some advise?

thank you for reading.

lhatsk commented 2 months ago

In this case, you take only the first entry, since that is the target sequence itself.

msa = feature_dict['msa']

if args.neff:
    logger.info(
        f"Subsampling MSA to Neff={args.neff}..."
    )
    #indices = subsample_msa_sequentially(msa, neff=args.neff)
    feature_dict['msa'] = msa[:1]
    feature_dict['deletion_matrix_int'] = feature_dict['deletion_matrix_int'][:1]
sehooni commented 2 months ago

Oh, thank you for reply.

By the way, then it means that neff=1 is same with no MSA?

And one more thing to ask. In features, there are the key 'num_alignments', which means the number of msa. does it need to change for neff?

lhatsk commented 2 months ago

By the way, then it means that neff=1 is same with no MSA?

Yes, it will amount to the same thing in the end.

And one more thing to ask. In features, there are the key 'num_alignments', which means the number of msa. does it need to change for neff?

You don't need to change anything else.

sehooni commented 2 months ago

Thank you for responding issue. By the way, I have another question!

in README, “Where restraints.csv is a comma-separated file containing residueFrom,residueTo,meanDistance,standard deviation, distribution type (normal/log-normal)”, what is the criterion of distance in meanDistance?

Because in my work, I have to make the distogram to train another model. Furthermore, I understood that the distogram means that the distance distribution between the CB atoms.

So, does it mean the CB atom distance between the both of crosslinking residues or CA atom distance?

thanks a lot!

lhatsk commented 2 months ago

We trained with CA-CA distances for the distogram. meanDistance would most likely be your cutoff, e.g., 10A for photoAA or 25A for SDA.

sehooni commented 2 months ago

Oh, I see. :)

then, why did you choose the CA-CA distance instead of the CB-CB distance? Is there any reason why? Because in AlphaFold2, they computed the distogram, which are computed from the ground-truth beta carbon positions for all amino acids except glycine where use alpha carbon instead.

lhatsk commented 2 months ago

Crosslinks are usually specified for CA-CA. The network will likely have no problem mapping it to CB-CB.